References

1 Course Overview

2 Prediction

2.1 In Sample vs Out of Sample Errors

  • in sample error = error resulted from applying your prediction algorithm to the dataset you built it with
    • also known as resubstitution error
    • often optimistic (less than on a new sample) as the model may be tuned to error of the sample
  • out of sample error = error resulted from applying your prediction algorithm to a new data set
    • also known as generalization error
    • out of sample error most important as it better evaluates how the model should perform
  • in sample error < out of sample error
    • reason is over-fitting: model too adapted/optimized for the initial dataset
      • data have two parts: signal vs noise
      • goal of predictor (should be simple/robust) is to find signal and ignore noise
      • it is possible to design an accurate in-sample predictor, but it captures both signal and noise
      • predictor won’t perform as well on new sample
    • often times it is better to give up a little accuracy for more robustness when predicting on new data
  • Example
# load data
library(kernlab); data(spam); set.seed(333)
# picking a small subset (10 values) from spam data set
smallSpam <- spam[sample(dim(spam)[1],size=10),]
# label spam = 2 and ham = 1 (for coloring)
spamLabel <- (smallSpam$type=="spam")*1 + 1
# plot the capitalAve values (= capital letters) for the dataset with colors differentiated by spam/ham (2 vs 1)
plot(smallSpam$capitalAve,col=spamLabel)

# first prediction rule (over-fitting to capture all variation)
rule1 <- function(x){
prediction <- rep(NA,length(x))
prediction[x > 2.7] <- "spam"
prediction[x < 2.40] <- "nonspam"
prediction[(x >= 2.40 & x <= 2.45)] <- "spam"
prediction[(x > 2.45 & x <= 2.70)] <- "nonspam"
return(prediction)
}
# tabulate results of prediction algorithm 1 (in sample error -> no error in this case)
table(rule1(smallSpam$capitalAve),smallSpam$type)
##          
##           nonspam spam
##   nonspam       5    0
##   spam          0    5
# second rule (simple, setting a threshold)
rule2 <- function(x){
prediction <- rep(NA,length(x))
prediction[x > 2.8] <- "spam"
prediction[x <= 2.8] <- "nonspam"
return(prediction)
}
# tabulate results of prediction algorithm 2(in sample error -> 10% in this case)
table(rule2(smallSpam$capitalAve),smallSpam$type)
##          
##           nonspam spam
##   nonspam       5    1
##   spam          0    4
# tabulate out of sample error for algorithm 1
table(rule1(spam$capitalAve),spam$type)
##          
##           nonspam spam
##   nonspam    2141  588
##   spam        647 1225
# tabulate out of sample error for algorithm 2
table(rule2(spam$capitalAve),spam$type)
##          
##           nonspam spam
##   nonspam    2224  642
##   spam        564 1171
# accuracy and total correct for algorithm 1 and 2
rbind("Rule 1" = c(Accuracy = mean(rule1(spam$capitalAve)==spam$type),
"Total Correct" = sum(rule1(spam$capitalAve)==spam$type)),
"Rule 2" = c(Accuracy = mean(rule2(spam$capitalAve)==spam$type),
"Total Correct" = sum(rule2(spam$capitalAve)==spam$type)))
##         Accuracy Total Correct
## Rule 1 0.7315801          3366
## Rule 2 0.7378831          3395

3 Prediction Study Design

3.1 Sample Division Guidelines for Prediction Study Design

  • for large sample sizes
    • 60% training
    • 20% test
    • 20% validation
  • for medium sample sizes
    • 60% training
    • 40% test
    • no validation set to refine model (to ensure test set is of sufficient size)
  • for small sample sizes
    • carefully consider if there are enough sample to build a prediction algorithm
    • no test/validation sets
    • perform cross validation
    • report caveat of small sample size and highlight the fact that the prediction algorithm has never been tested for out of sample error
  • there should always be a test/validation set that is held away and should NOT be looked at when building model
    • when complete, apply the model to the held-out set only one time
  • randomly sample training and test sets
    • for data collected over time, build training set in chunks of times
  • datasets must reflect structure of problem
    • if prediction evolves with time, split train/test sets in time chunks (known as backtesting in finance)
  • subsets of data should reflect as much diversity as possible
    • random assignment does this
    • you can also try to balance by features (but this is tricky)

3.2 Picking the Right Data

  • To predict \(X\) use data related to \(X\), eg. use like data to predict like
  • to predict a variable/process X, use the data that’s as closely related to X as possible
    • eg. Movie “Moneyball”: to predict player performance use data about player performance
    • eg. Netflix: predict movie preferences use data about movie preferences
  • but this is not a hard rule
  • in general, the looser the connection, the harder the prediction gets
  • weighting the data/variables by understanding and intuition can help to improve accuracy of prediction
  • data properties matter \(\rightarrow\) knowing how the data connects to what you are trying to predict
  • predicting on unrelated data is the most common mistake
    • if unrelated data must be used, be careful about interpreting the model as to why it works/doesn’t work

4 Types of Errors

4.1 Notable Measurements for Error – Binary Variables

  • accuracy = weights false positives/negatives equally
  • concordance = for multi-class cases, \[\kappa = \frac{accuracy - P(e)}{1 - P(e)}\] where \[P(e) = \frac{TP+FP}{total} \times \frac{TP+FN}{total} + \frac{TN+FN}{total} \times \frac{FP+TN}{total}\]

  • Example: Screening tests
    • given that a disease has 0.1% prevalence in the population, we want to know what’s probability of a person having the disease given the test result is positive? the test kit for the disease is 99% sensitive (most positives = disease) and 99% specific (most negatives = no disease)
    • what about 10% prevalence (= high risk-population)?

4.2 Notable Measurements for Error – Continuous Variables

  • Mean squared error (MSE) = \[\frac{1}{n} \sum_{i=1}^n (Prediction_i - Truth_i)^2\]
  • Root mean squared error (RMSE) - \[\sqrt{\frac{1}{n} \sum_{i=1}^n(Prediction_i - Truth_i)^2}\]
    • in the same units as variable
    • most commonly used error measure for continuous data
    • is not an effective measures when there are outliers
      • one large value may significantly raise the RMSE
  • Median absolute deviation = \[median(|Prediction_i - Truth_i|)\]

4.3 Common error measures

  1. Mean squared error (or root mean squared error)
    • Continuous data, sensitive to outliers
  2. Median absolute deviation
    • Continuous data, often more robust
  3. Sensitivity (recall)
    • If you want few missed positives
  4. Specificity
    • If you want few negatives called positives
  5. Accuracy
    • Weights false positives/negatives equally
  6. Concordance
  7. Predictive value of a positive (precision)
    • When you are screening and prevelance is low

5 Receiver Operating Characteristic Curves

6 Cross Validation

6.1 Random Subsampling

  • a randomly sampled test set is subsetted out from the original training set
  • the predictor is built on the remaining training data and applied to the test set
  • the above are three random subsamplings from the same training set
    • we can then average the estimated errors
  • considerations
    • must be done without replacement
    • random sampling with replacement = bootstrap
      • underestimates the error, since if we get one right and the sample appears more than once we’ll get the other right
      • can be corrected with the (0.632 Bootstrap), but it is complicated

6.2 K-Fold

  • break training set into \(K\) subsets (above is a 3-fold cross validation)
  • build the model/predictor on the remaining training data in each subset and applied to the test subset
  • rebuild the data \(K\) times with the training and test subsets and average the errors
  • considerations
    • larger k = less bias, more variance
    • smaller k = more bias, less variance

6.3 Leave One Out

  • leave out exactly one sample and build predictor on the rest of training data
  • predict value for the left out sample
  • repeat for each sample

7 caret Package (Tutorial)

7.1 Data Slicing

  • createDataPartition(y=data$var, times=1, p=0.75, list=FALSE) \(\rightarrow\) creates data partitions using given variable
    • y=data$var = specifies what outcome/variable to split the data on
    • times=1 = specifies number of partitions to create (number of data splitting performed)
    • p=0.75 = percent of data that will be for training the model
    • list=FALSE = returns a matrix of indices corresponding to p% of the data (training set)
      • Note: matrix is easier to subset the data with, so list = FALSE is generally what is used
      • list=TRUE = returns a list of indices corresponding to p% of the data (training set)
    • the function effectively returns a list of indexes of the training set which can then be leveraged to subset the data
      • training<-data[inTrain, ] = subsets the data to training set only
      • testing<-data[-inTrain, ] = the rest of the data set can then be stored as the test set
    • Example
# load packages and data
library(caret); library(kernlab); data(spam)
# create training set indexes with 75% of data
inTrain <- createDataPartition(y=spam$type,p=0.75, list=FALSE)
# subset spam data to training
training <- spam[inTrain,]
# subset spam data (the rest) to test
testing <- spam[-inTrain,]
# dimension of original and training dataset
rbind("original dataset" = dim(spam),"training set" = dim(training))
##                  [,1] [,2]
## original dataset 4601   58
## training set     3451   58
  • createFolds(y=data$var, k=10, list=TRUE, returnTrain=TRUE) = slices the data in to \(k\) folds for cross validation and returns \(k\) lists of indices
    • y=data$var = specifies what outcome/variable to split the data on
    • k=10 = specifies number of folds to create (See K Fold Cross Validation)
      • each training set has approximately has \(\frac{k-1}{k} \%\) of the data (in this case 90%)
      • each testing set has approximately has \(\frac{1}{k} \%\) of the data (in this case 10%)
    • list=TRUE = returns k list of indices that corresponds to the cross-validated sets
      • Note: the returned list conveniently splits the data into k datasets/vectors of indices, so list=TRUE is generally what is used
      • when the returned object is a list (called folds in the case), you can use folds[[1]][1:10] to access different elements from that list
      • list=FALSE = returns a vector indicating which of the k folds each data point belongs to (i.e. 1 - 10 is assigned for each of the data points in this case)
        • Note: these group values corresponds to test sets for each cross validation, which means everything else besides the marked points should be used for training
    • [only works when list=T] returnTrain=TRUE = returns the indices of the training sets
      • [default value when unspecified]returnTrain=FALSE = returns indices of the test sets
    • Example
# create 10 folds for cross validation and return the training set indices
folds <- createFolds(y=spam$type,k=10,list=TRUE,returnTrain=TRUE)
# structure of the training set indices
str(folds)
## List of 10
##  $ Fold01: int [1:4141] 1 2 3 4 6 7 8 9 10 12 ...
##  $ Fold02: int [1:4140] 1 2 3 4 5 7 8 9 10 11 ...
##  $ Fold03: int [1:4141] 1 3 4 5 6 7 8 9 10 11 ...
##  $ Fold04: int [1:4141] 2 3 4 5 6 7 8 9 10 11 ...
##  $ Fold05: int [1:4141] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Fold06: int [1:4141] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Fold07: int [1:4141] 1 2 3 4 5 6 8 9 11 12 ...
##  $ Fold08: int [1:4141] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Fold09: int [1:4141] 1 2 4 5 6 7 10 11 12 13 ...
##  $ Fold10: int [1:4141] 1 2 3 5 6 7 8 9 10 11 ...
# return the test set indices instead
# note: returnTrain = FALSE is unnecessary as it is the default behavior
folds.test <- createFolds(y=spam$type,k=10,list=TRUE,returnTrain=FALSE)
str(folds.test)
## List of 10
##  $ Fold01: int [1:460] 15 16 18 40 45 62 68 81 82 102 ...
##  $ Fold02: int [1:459] 1 41 55 58 67 75 117 123 151 175 ...
##  $ Fold03: int [1:461] 3 14 66 69 70 80 90 112 115 135 ...
##  $ Fold04: int [1:460] 5 19 25 65 71 83 85 88 91 93 ...
##  $ Fold05: int [1:460] 6 10 17 21 26 56 57 104 107 116 ...
##  $ Fold06: int [1:459] 7 8 13 39 52 54 76 89 99 106 ...
##  $ Fold07: int [1:461] 4 23 27 29 32 33 34 38 49 51 ...
##  $ Fold08: int [1:460] 2 9 30 31 36 37 43 46 47 48 ...
##  $ Fold09: int [1:461] 12 20 24 44 53 59 60 64 84 98 ...
##  $ Fold10: int [1:460] 11 22 28 35 42 61 72 86 92 118 ...
# return first 10 elements of the first training set
folds[[1]][1:10]
##  [1]  1  2  3  4  6  7  8  9 10 12
  • createResample(y=data$var, times=10, list=TRUE) = create 10 resamplings from the given data with replacement
    • list=TRUE = returns list of n vectors that contain indices of the sample
      • Note: each of the vectors is of length of the data, and contains indices
    • times=10 = number of samples to create
# create 10 resamples
resamples <- createResample(y=spam$type,times=10,list=TRUE)
# structure of the resamples (note some samples are repeated)
str(resamples)
## List of 10
##  $ Resample01: int [1:4601] 1 4 4 4 7 8 12 13 13 14 ...
##  $ Resample02: int [1:4601] 3 3 5 7 10 12 12 13 13 14 ...
##  $ Resample03: int [1:4601] 1 2 2 3 4 5 8 10 11 12 ...
##  $ Resample04: int [1:4601] 1 3 3 4 7 8 8 9 10 14 ...
##  $ Resample05: int [1:4601] 2 4 5 6 7 7 8 8 9 12 ...
##  $ Resample06: int [1:4601] 3 6 6 7 8 9 12 13 13 14 ...
##  $ Resample07: int [1:4601] 1 2 2 5 5 6 7 8 9 10 ...
##  $ Resample08: int [1:4601] 2 2 3 4 4 7 7 8 8 9 ...
##  $ Resample09: int [1:4601] 1 4 7 8 8 9 12 13 15 15 ...
##  $ Resample10: int [1:4601] 1 3 4 4 7 7 9 9 10 11 ...
  • createTimeSlices(y=data, initialWindow=20, horizon=10) = creates training sets with specified window length and the corresponding test sets
    • initialWindow=20 = number of consecutive values in each time slice/training set (i.e. values 1 - 20)
    • horizon=10 = number of consecutive values in each predict/test set (i.e. values 21 - 30)
    • fixedWindow=FALSE = training sets always start at the first observation
      • this means that the first training set would be 1 - 20, the second will be 1 - 21, third 1 - 22, etc.
      • but the test sets are still like before (21 - 30, 22 - 31, etc.)
# create time series data
tme <- 1:1000
# create time slices
folds <- createTimeSlices(y=tme,initialWindow=20,horizon=10)
# name of lists
names(folds)
## [1] "train" "test"
# first training set
folds$train[[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
# first test set
folds$test[[1]]
##  [1] 21 22 23 24 25 26 27 28 29 30

7.2 Training Options (tutorial)

  • train(y ~ x, data=df, method="glm") = function to apply the machine learning algorithm to construct model from training data
# returns the arguments of the default train function
args(caret:::train.default)
## function (x, y, method = "rf", preProcess = NULL, ..., weights = NULL, 
##     metric = ifelse(is.factor(y), "Accuracy", "RMSE"), maximize = ifelse(metric %in% 
##         c("RMSE", "logLoss", "MAE"), FALSE, TRUE), trControl = trainControl(), 
##     tuneGrid = NULL, tuneLength = ifelse(trControl$method == 
##         "none", 1, 3)) 
## NULL
  • train function has a large set of parameters, below are the default options
    • method="rf" = default algorithm is random forest for training a given data set; caret contains a large number of algorithms
      • names(getModelInfo()) = returns all the options for method argument
      • list of models and their information can be found here
    • preProcess=NULL = set preprocess options (see Preprocessing)
    • weights=NULL = can be used to add weights to observations, useful for unbalanced distribution (a lot more of one type than another)
    • metric=ifelse(is.factor(y), "Accuracy", "RMSE") = default metric for algorithm is Accuracy for factor variables, and RMSE, or root mean squared error, for continuous variables
      • Continuous Outcomes:
        • RMSE = Root mean squared error
        • RSquared can also be used here as a metric, which represents R2 from regression models (only useful for linear models)
      • Categorical Outcomes:
    • maximize=ifelse(metric=="RMSE", FALSE, TRUE) = the algorithm should maximize accuracy and minimize RMSE
    • trControl=trainControl() = training controls for the model, more details below
    • tuneGrid=NULL
    • tuneLength=3
# returns the default arguments for the trainControl object
args(trainControl)
## function (method = "boot", number = ifelse(grepl("cv", method), 
##     10, 25), repeats = ifelse(grepl("[d_]cv$", method), 1, NA), 
##     p = 0.75, search = "grid", initialWindow = NULL, horizon = 1, 
##     fixedWindow = TRUE, skip = 0, verboseIter = FALSE, returnData = TRUE, 
##     returnResamp = "final", savePredictions = FALSE, classProbs = FALSE, 
##     summaryFunction = defaultSummary, selectionFunction = "best", 
##     preProcOptions = list(thresh = 0.95, ICAcomp = 3, k = 5, 
##         freqCut = 95/5, uniqueCut = 10, cutoff = 0.9), sampling = NULL, 
##     index = NULL, indexOut = NULL, indexFinal = NULL, timingSamps = 0, 
##     predictionBounds = rep(FALSE, 2), seeds = NA, adaptive = list(min = 5, 
##         alpha = 0.05, method = "gls", complete = TRUE), trim = FALSE, 
##     allowParallel = TRUE) 
## NULL
  • trainControl creates an object that sets many options for how the model will be applied to the training data
    • Note: the default values are listed below but you can use them to set the parameters to your discretion
    • method="boot" = for controlling the resampling
      • "boot" = bootstrapping (drawing with replacement)
      • "boot632" = bootstrapping with adjustment
      • "cv" = cross validation
      • "repeatedcv" = repeated cross validation
      • "LOOCV" = leave one out cross validation
    • number=ifelse(grepl("cv", method),10, 25) = number of subsamples to take (for boot/cross validation)
      • number=10 = default for any kind of cross validation
      • number=25 = default for bootstrapping
      • Note: number should be increased when fine-tuning model with large number of parameter
    • repeats=ifelse(grepl("cv", method), 1, number) = numbers of times to repeat the subsampling
      • repeats=1 = default for any cross validation method
      • repeats=25 = default for bootstrapping
      • Note: if big this can slow things down
    • p=0.75 = default percentage of data to create training sets
    • initialWindow=NULL, horizon=1, fixedWindow=TRUE = parameters for time series data
    • verboseIter=FALSE = print the training logs
    • returnData=TRUE, returnResamp = “final”,
    • savePredictions=FALSE = save the predictions for each resample
    • classProbs=FALSE = return classification probabilities along with the predictions
    • summaryFunction=defaultSummary = default summary of the model,
    • preProcOptions=list(thresh = 0.95, ICAcomp = 3, k = 5) = specifies preprocessing options for the model
    • predictionBounds=rep(FALSE, 2) = specify the range of the predicted value
      • for numeric predictions, predictionBounds=c(10, NA) would mean that any value lower than 10 would be treated as 10 and no upper bounds
    • seeds=NA = set the seed for the operation
      • often useful to set an overall seed, but can also set a seed for each resample (this is useful for parallel fits)
      • Note: setting this is important when you want to reproduce the same results when the train function is run
    • allowParallel=TRUE = sets for parallel processing/computations

7.3 Plotting Predictors (tutorial)

  • it is important to only plot the data in the training set
    • using the test data may lead to over-fitting (model should not be adjusted to test set)
  • goal of producing these exploratory plots = look for potential outliers, skewness, imbalances in outcome/predictors, and explainable groups of points/patterns
  • featurePlot(x=predictors, y=outcome, plot="pairs") = short cut to plot the relationships between the predictors and outcomes
# load relevant libraries
library(ISLR); library(ggplot2); library(caret)
# load wage data
data(Wage)
# create training and test sets
inTrain <- createDataPartition(y=Wage$wage,p=0.7, list=FALSE)
training <- Wage[inTrain,]
testing <- Wage[-inTrain,]
# plot relationships between the predictors and outcome
featurePlot(x=training[,c("age","education","jobclass")], y = training$wage,plot="pairs")

  • qplot(age, wage, color=eduction, data=training) = can be used to separate data by a factor variable (by coloring the points differently)
    • geom_smooth(method = "lm") = adds a regression line to the plots
    • geom=c("boxplot", "jitter") = specifies what kind of plot to produce, in this case both the boxplot and the point cloud
# qplot plus linear regression lines
qplot(age,wage,colour=education,data=training)+geom_smooth(method='lm',formula=y~x)

  • cut2(variable, g=3) = creates a new factor variable by cutting the specified variable into n groups (3 in this case) based on percentiles
    • Note: cut2 function is part of the Hmisc package, so library(Hmisc) must be run first
    • this variable can then be used to tabulate/plot the data
  • grid.arrange(p1, p2, ncol=2) = ggplot2 function the print multiple graphs on the same plot
    • Note: grid.arrange function is part of the gridExtra package, so library(gridExtra) must be run first
# load Hmisc and gridExtra packages
library(Hmisc);library(gridExtra);
# cute the wage variable
cutWage <- cut2(training$wage,g=3)
table(cutWage)
## cutWage
## [ 20.1, 92.7) [ 92.7,119.1) [119.1,318.3] 
##           701           723           678
# plot the boxplot
p1 <- qplot(cutWage,age, data=training,fill=cutWage,
geom=c("boxplot"))
# plot boxplot and point clusters
p2 <- qplot(cutWage,age, data=training,fill=cutWage,
geom=c("boxplot","jitter"))
# plot the two graphs side by side
grid.arrange(p1,p2,ncol=2)

  • table(cutVariable, data$var2) = tabulates the cut factor variable vs another variable in the dataset (ie; builds a contingency table using cross-classifying factors)
  • prop.table(table, margin=1) = converts a table to a proportion table
    • margin=1 = calculate the proportions based on the rows
    • margin=2 = calculate the proportions based on the columns
# tabulate the cutWage and jobclass variables
t <- table(cutWage,training$jobclass)
# print table
t
##                
## cutWage         1. Industrial 2. Information
##   [ 20.1, 92.7)           445            256
##   [ 92.7,119.1)           376            347
##   [119.1,318.3]           269            409
# convert to proportion table based on the rows
prop.table(t,1)
##                
## cutWage         1. Industrial 2. Information
##   [ 20.1, 92.7)     0.6348074      0.3651926
##   [ 92.7,119.1)     0.5200553      0.4799447
##   [119.1,318.3]     0.3967552      0.6032448
  • qplot(var1, color=var2, data=training, geom="density") = produces density plot for the given numeric and factor variables
    • effectively smoothed out histograms
    • provides for easy overlaying of groups of data
      • break different variables up by group and see how outcomes change between groups
# produce density plot
qplot(wage,colour=education,data=training,geom="density")

  • Notes and Comments
    • Make your plots only in the training set
      • Don’t use the test set for exploration!
    • Things you should be looking for
      • Imbalance in outcomes/predictors
      • Outliers
      • Groups of points not explained by a predictor
      • Skewed variables
        • perhaps transform them to make them more normally distributed for regression models
        • matters not so much when using machine learning

7.4 Preprocessing (tutorial)

  • some predictors may have strange distributions (i.e. skewed) and may need to be transformed to be more useful for prediction algorithm
    • particularly true for model based algorithms \(\rightarrow\) naive Bayes, linear discriminate analysis, linear regression
  • centering = subtracting the observations of a particular variable by its mean
  • scaling = dividing the observations of a particular variable by its standard deviation
  • normalizing = centering and scaling the variable \(\rightarrow\) effectively converting each observation to the number of standard deviations away from the mean
    • the distribution of the normalized variable will have a mean of 0 and standard deviation of 1
    • Note: normalizing data can help remove bias and high variability, but may not be applicable in all cases
  • Note: if a predictor/variable is standardized when training the model, the same transformations must be performed on the test set with the mean and standard deviation of the train variables
    • this means that the mean and standard deviation of the normalized test variable will NOT be 0 and 1, respectively, but will be close
    • transformations must likely be imperfect but test/train sets must be processed the same way
  • train(y~x, data=training, preProcess=c("center", "scale")) = preprocessing can be directly specified in the train function
    • preProcess=c("center", "scale") = normalize all predictors before constructing model
  • preProcess(trainingData, method=c("center", "scale") = function in the caret to standardize data
    • you can store the result of the preProcess function as an object and apply it to the train and test sets using the predict function
# load spam data
library(caret); library(kernlab); data(spam)
# create train and test sets
inTrain <- createDataPartition(y=spam$type,p=0.75, list=FALSE)
training <- spam[inTrain,]
testing <- spam[-inTrain,]
# create a histogram
hist(training$capitalAve,main="",xlab="ave. capital run length")

# mean and sd
mean(training$capitalAve)
## [1] 4.688791
sd(training$capitalAve)
## [1] 26.64678
  • we can see that almost all of the run links are very small, but there are a few that are much, much larger. This is an example of a variable that is very skewed, and, so it’s very hard to deal with in model based predictors and so you might want to preProcess.
  • we also see that the standard deviation is huge, it’s much much larger than the mean. So, it’s much more highly variable variable. And so, what you might want to do is do some sort of preprocessing, so that the machine learning algorithms don’t get tricked by the fact that it’s skewed and highly variable
# create preProcess object for all predictors ("-58" because 58th = outcome)
preObj <- preProcess(training[,-58],method=c("center","scale"))
# normalize training set
trainCapAveS <- predict(preObj,training[,-58])$capitalAve
# normalize test set using training parameters
testCapAveS <- predict(preObj,testing[,-58])$capitalAve
# compare results for capitalAve variable
rbind(train = c(mean = mean(trainCapAveS), std = sd(trainCapAveS)),
test = c(mean(testCapAveS), sd(testCapAveS)))
##               mean      std
## train 6.097035e-18 1.000000
## test  7.548133e-02 1.633866
  • preprocess(data, method="BoxCox") = applies Box-Cox transformations to continuous data to help normalize the variables through maximum likelihood
    • Note: note it assumes continuous values and DOES NOT deal with repeated values
    • qqnorm(processedVar) = can be used to produce the Q-Q plot which compares the theoretical quantiles with the sample quantiles to see the normality of the data
# set up BoxCox transforms
preObj <- preProcess(training[,-58],method=c("BoxCox"))
# perform preprocessing on training data
trainCapAveS <- predict(preObj,training[,-58])$capitalAve
# plot histogram and QQ Plot
# Note: the transformation definitely helped to
# normalize the data but it does not produce perfect result
par(mfrow=c(1,2)); hist(trainCapAveS); qqnorm(trainCapAveS)

  • preProcess(data, method="knnImpute") = impute/estimate the missing data using k nearest neighbors (knn) imputation
    • knnImpute = takes the k nearest neighbors from the missing value and averages the value to impute the missing observations
    • Note: most prediction algorithms are not build to handle missing data
# Make some values NA
training$capAve <- training$capitalAve
selectNA <- rbinom(dim(training)[1],size=1,prob=0.05)==1
training$capAve[selectNA] <- NA
# Impute and standardize
preObj <- preProcess(training[,-58],method="knnImpute")
capAve <- predict(preObj,training[,-58])$capAve
# Standardize true values
capAveTruth <- training$capitalAve
capAveTruth <- (capAveTruth-mean(capAveTruth))/sd(capAveTruth)
# compute differences between imputed values and true values
# to see how close those two values are to each other
quantile(capAve - capAveTruth)
##            0%           25%           50%           75%          100% 
## -1.656344e+00  2.377772e-05  1.286900e-03  1.880821e-03  3.174413e-01
quantile((capAve - capAveTruth)[selectNA])
##           0%          25%          50%          75%         100% 
## -1.656344036 -0.009489595  0.001683679  0.015559636  0.317441284
quantile((capAve - capAveTruth)[!selectNA])
##            0%           25%           50%           75%          100% 
## -8.465254e-01  9.822581e-05  1.284404e-03  1.850043e-03  2.349137e-03
  • K-nearest neighbors imputing finds the k, for example if k equals to ten, then the 10 nearest data vectors that look most like the data vector with the missing value, and averages the values of the variable that’s missing and compute them at that position

  • Notes and Comments
    • Training and test must be processed in the same way
    • This means test transformations will likely be imperfect
      • Especially if the test/training sets are collected at different times
    • Careful when transforming factor variables (as compared to continous variables)!!!
      • (all of the transformations mentioned above other than imputation are based on continuous variables)

8 Covariate Creation/Feature Extraction

8.1 Creating Dummy Variables

  • convert factor variables to indicator/dummy variable \(\rightarrow\) qualitative become quantitative
  • dummyVars(outcome~var, data=training) = creates a dummy variable object that can be used through predict function to create dummy variables
    • predict(dummyObj, newdata=training) = creates appropriate columns to represent the factor variable with appropriate 0s and 1s
      • 2 factor variable \(\rightarrow\) two columns which have 0 or 1 depending on the outcome
      • 3 factor variable \(\rightarrow\) three columns which have 0, 0, and 1 representing the outcome
      • Note: only one of the columns can have values of 1 for each observation
library(ISLR); library(caret); data(Wage);
# setting up data
inTrain <- createDataPartition(y=Wage$wage,p=0.7, list=FALSE)
training <- Wage[inTrain,]; testing <- Wage[-inTrain,]
# create a dummy variable object
dummies <- dummyVars(wage ~ jobclass,data=training)
# create the dummy variable columns
head(predict(dummies,newdata=training))
##        jobclass.1. Industrial jobclass.2. Information
## 231655                      1                       0
## 86582                       0                       1
## 161300                      1                       0
## 155159                      0                       1
## 11443                       0                       1
## 376662                      0                       1

8.2 Removing Zero Covariates

  • some variables have no variability at all (eg. variable indicating if an email contains letters)
  • these variables are not useful when we want to construct a prediction model
  • nearZeroVar(training, saveMetrics=TRUE) = returns list of variables in training data set with information on frequency ratios, percent uniques, whether or not it has zero variance
    • freqRatio = ratio of frequencies for the most common value over second most common value
    • percentUnique = percentage of unique data points out of total number of data points
    • zeroVar = TRUE/FALSE indicating whether the predictor has only one distinct value
    • nzv = TRUE/FALSE indicating whether the predictor is a near zero variance predictor
    • Note: when nzv = TRUE, those variables should be thrown out
# print nearZeroVar table
nearZeroVar(training,saveMetrics=TRUE)
##            freqRatio percentUnique zeroVar   nzv
## year        1.017647    0.33301618   FALSE FALSE
## age         1.231884    2.85442436   FALSE FALSE
## maritl      3.329571    0.23786870   FALSE FALSE
## race        8.480583    0.19029496   FALSE FALSE
## education   1.393750    0.23786870   FALSE FALSE
## region      0.000000    0.04757374    TRUE  TRUE
## jobclass    1.070936    0.09514748   FALSE FALSE
## health      2.526846    0.09514748   FALSE FALSE
## health_ins  2.209160    0.09514748   FALSE FALSE
## logwage     1.011765   18.83920076   FALSE FALSE
## wage        1.011765   18.83920076   FALSE FALSE

8.3 Creating Splines (Polynomial Functions)

  • when you want to fit curves through the data, basis functions can be leveraged
  • [splines package] bs(data$var, df=3) = creates 3 new columns corresponding to the var, var2, and var3 terms
  • ns() and poly() can also be used to generate polynomials
  • gam() function in the caret package can also be used and it allows for smoothing of multiple variables with different values for each variable
  • Note: the same polynomial operations must be performed on the test sets using the predict function
# load splines package
library(splines)
# create polynomial function
bsBasis <- bs(training$age,df=3)
head(bsBasis) # (values are scaled for computational purposes)
##              1          2           3
## [1,] 0.0000000 0.00000000 0.000000000
## [2,] 0.2368501 0.02537679 0.000906314
## [3,] 0.4163380 0.32117502 0.082587862
## [4,] 0.4308138 0.29109043 0.065560908
## [5,] 0.3625256 0.38669397 0.137491189
## [6,] 0.3063341 0.42415495 0.195763821
# fit the outcome on the three polynomial terms
lm1 <- lm(wage ~ bsBasis,data=training)
# plot all age vs wage data
plot(training$age,training$wage,pch=19,cex=0.5)
# plot the fitted polynomial function
points(training$age,predict(lm1,newdata=training),col="red",pch=19,cex=0.5)

# predict function on test values
head(predict(bsBasis,age=testing$age))
##              1          2           3
## [1,] 0.0000000 0.00000000 0.000000000
## [2,] 0.2368501 0.02537679 0.000906314
## [3,] 0.4163380 0.32117502 0.082587862
## [4,] 0.4308138 0.29109043 0.065560908
## [5,] 0.3625256 0.38669397 0.137491189
## [6,] 0.3063341 0.42415495 0.195763821

9 Preprocessing with Principal Component Analysis (PCA)

9.1 prcomp Function

  • pr<-prcomp(data) = performs PCA on all variables and returns a prcomp object that contains information about standard deviations and rotations
    • pr$rotations = returns eigenvectors for the linear combinations of all variables (coefficients that variables are multiplied by to come up with the principal components) \(\rightarrow\) how the principal components are created
    • often times, it is useful to take the log transformation of the variables and adding 1 before performing PCA
      • helps to reduce skewness or strange distribution in data
      • log(0) = - infinity, so we add 1 to account for zero values
      • makes data more Gaussian
    • plot(pr) = plots the percent variation explained by the first 10 principal components (PC)
      • can be used to find the PCs that represent the most variation
# load spam data
library(caret); library(kernlab); data(spam)
# create train and test sets
inTrain <- createDataPartition(y=spam$type,p=0.75, list=FALSE)
training <- spam[inTrain,]
testing <- spam[-inTrain,]
# correlated predictors
M <- abs(cor(training[,-58]))
diag(M) <- 0
which(M > 0.8, arr.ind=T)
##        row col
## num415  34  32
## direct  40  32
## num857  32  34
## direct  40  34
## num857  32  40
## num415  34  40
# plot for example num857 and num415
# frequencies of the two variables are highly correlated
plot(spam[,34],spam[,32])

# we could rotate the plot
# X = 0.71*num415 + 0.71*num857
# Y = 0.71*num415 - 0.71*num857
X <- 0.71*training$num415 + 0.71*training$num857
Y <- 0.71*training$num415 - 0.71*training$num857
# perform PCA on num415 and num857
smallSpam <- spam[,c(34,32)]
prComp <- prcomp(smallSpam)
# plot rotation and first two PCs
par(mfrow = c(1,2))
plot(X,Y) # most of the variability is happening on the x-axis, clustered on zero
plot(prComp$x[,1],prComp$x[,2])

prComp$rotation
##              PC1        PC2
## num415 0.7080625  0.7061498
## num857 0.7061498 -0.7080625
# perform PCA on dataset
# use log and add 1 helps to reduce skewness or strange distribution in data
prComp <- prcomp(log10(spam[,-58]+1))
# print out the eigenvector/rotations first 5 rows and PCs
# eg. PC1 is made up by a linear combination of the variables (column 1)
head(prComp$rotation[, 1:5], 5)
##                 PC1           PC2         PC3         PC4          PC5
## make    0.019370409  0.0427855959 -0.01631961  0.02798232 -0.014903314
## address 0.010827343  0.0408943785  0.07074906 -0.01407049  0.037237531
## all     0.040923168  0.0825569578 -0.03603222  0.04563653  0.001222215
## num3d   0.006486834 -0.0001333549  0.01234374 -0.01005991 -0.001282330
## our     0.036963221  0.0941456085 -0.01871090  0.05098463 -0.010582039
# create new variable that marks spam as 2 and nospam as 1
typeColor <- ((spam$type=="spam")*1 + 1)
# plot the first two principal components
par(mfrow = c(1,1))
plot(prComp$x[,1],prComp$x[,2],col=typeColor,xlab="PC1",ylab="PC2")

9.2 caret Package

  • pp<-preProcess(log10(training[,-58]+1),method="pca",pcaComp=2,thresh=0.8)) = perform PCA with preProcess function and returns the number of principal components that can capture the majority of the variation
    • creates a preProcess object that can be applied using predict function
    • pcaComp=2 = specifies the number of principal components to compute (2 in this case)
    • thresh=0.8 = threshold for variation captured by principal components
      • thresh=0.95 = default value, which returns the number of principal components that are needed to capture 95% of the variation in data
  • predict(pp, training) = computes new variables for the PCs (2 in this case) for the training data set
    • the results from predict can then be used as data for the prediction model
    • Note: the same PCA must be performed on the test set
# create train and test sets
inTrain <- createDataPartition(y=spam$type,p=0.75, list=FALSE)
training <- spam[inTrain,]
testing <- spam[-inTrain,]
# create preprocess object
preProc <- preProcess(log10(training[,-58]+1),method="pca",pcaComp=2)
# calculate PCs for training data
trainPC <- predict(preProc,log10(training[,-58]+1))
# plot the two principal components
typeColor <- ((spam$type=="spam")*1 + 1)
plot(trainPC[,1],trainPC[,2],col=typeColor)

# run model on outcome and principle components
# OLDER VERSION modelFit <- train(training$type ~ .,method="glm",data=trainPC)
modelFit <- train(x = trainPC, y = training$type, method="glm")
# calculate PCs for test data
testPC <- predict(preProc,log10(testing[,-58]+1))
# compare results
confusionMatrix(testing$type,predict(modelFit,testPC))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam     650   47
##    spam         74  379
##                                           
##                Accuracy : 0.8948          
##                  95% CI : (0.8756, 0.9119)
##     No Information Rate : 0.6296          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7773          
##  Mcnemar's Test P-Value : 0.0181          
##                                           
##             Sensitivity : 0.8978          
##             Specificity : 0.8897          
##          Pos Pred Value : 0.9326          
##          Neg Pred Value : 0.8366          
##              Prevalence : 0.6296          
##          Detection Rate : 0.5652          
##    Detection Prevalence : 0.6061          
##       Balanced Accuracy : 0.8937          
##                                           
##        'Positive' Class : nonspam         
## 
  • alternatively, PCA can be directly performed with the train method
    • train(outcome ~ ., method="glm", preProcess="pca", data=training) = performs PCA first on the training set and then runs the specified model
      • effectively the same procedures as above (preProcess \(\rightarrow\) predict)
  • Note: in both cases, the PCs were able to achieve 90+% accuracy
# construct model
modelFit <- train(type ~ ., method="glm", preProcess="pca", data=training)
# print results of model
confusionMatrix(testing$type,predict(modelFit,testing))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam     661   36
##    spam         55  398
##                                           
##                Accuracy : 0.9209          
##                  95% CI : (0.9037, 0.9358)
##     No Information Rate : 0.6226          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.8331          
##  Mcnemar's Test P-Value : 0.05917         
##                                           
##             Sensitivity : 0.9232          
##             Specificity : 0.9171          
##          Pos Pred Value : 0.9484          
##          Neg Pred Value : 0.8786          
##              Prevalence : 0.6226          
##          Detection Rate : 0.5748          
##    Detection Prevalence : 0.6061          
##       Balanced Accuracy : 0.9201          
##                                           
##        'Positive' Class : nonspam         
## 

10 Predicting with Regression

10.1 Example - Predicting with Regression

  • lm<-lm(y ~ x, data=train) = runs a linear model of outcome y on predictor x \(\rightarrow\) univariate regression
    • summary(lm) = returns summary of the linear regression model, which will include coefficients, standard errors, \(t\) statistics, and p values
    • lm(y ~ x1+x2+x3, data=train) = run linear model of outcome y on predictors x1, x2, and x3
    • lm(y ~ ., data=train = run linear model of outcome y on all predictors
  • predict(lm, newdata=df) = use the constructed linear model to predict outcomes (\(\hat Y_i\)) for the new values
    • newdata data frame must have the same variables (factors must have the same levels) as the training data
    • newdata=test = predict outcomes for the test set based on linear regression model from the training
    • Note: the regression line will not be a perfect fit on the test set since it was constructed on the training set
  • RSME can be calculated to measure the accuracy of the linear model
    • Note: \(RSME_{test}\), which estimates the out-of-sample error, is almost always GREATER than \(RSME_{train}\)
# load data
library(caret); data(faithful); set.seed(333)
# create train and test sets
inTrain <- createDataPartition(y=faithful$waiting, p=0.5, list=FALSE)
trainFaith <- faithful[inTrain,]; testFaith <- faithful[-inTrain,]
# explore data
head(trainFaith)
##   eruptions waiting
## 1     3.600      79
## 3     3.333      74
## 5     4.533      85
## 6     2.883      55
## 7     4.700      88
## 8     3.600      85
plot(trainFaith$waiting, trainFaith$eruptions, pch=19, col="blue", xlab="Waiting", ylab="Duration")

  • Fit a linear model: \(ED_i = b_0 + b_1 WT_i + e_i\)
# build linear model
lm1 <- lm(eruptions ~ waiting,data=trainFaith)
# print summary of linear model
summary(lm1)
## 
## Call:
## lm(formula = eruptions ~ waiting, data = trainFaith)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.26990 -0.34789  0.03979  0.36589  1.05020 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.792739   0.227869  -7.867 1.04e-12 ***
## waiting      0.073901   0.003148  23.474  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.495 on 135 degrees of freedom
## Multiple R-squared:  0.8032, Adjusted R-squared:  0.8018 
## F-statistic:   551 on 1 and 135 DF,  p-value: < 2.2e-16
# predict eruptions for new waiting time
newdata <- data.frame(waiting=80)
predict(lm1,newdata)
##        1 
## 4.119307
# create 1 x 2 panel plot
par(mfrow=c(1,2))
# plot train data with the regression line
plot(trainFaith$waiting,trainFaith$eruptions,pch=19,col="blue",xlab="Waiting",
ylab="Duration", main = "Train")
lines(trainFaith$waiting,predict(lm1),lwd=3)
# plot test data with the regression line
plot(testFaith$waiting,testFaith$eruptions,pch=19,col="blue",xlab="Waiting",
ylab="Duration", main = "Test")
lines(testFaith$waiting,predict(lm1,newdata=testFaith),lwd=3)

# Calculate RMSE on training and test sets
anova.table <-anova(lm1)
trainRMSE.anova <- sqrt(anova.table$"Mean Sq"[2])
c(trainRMSE.anova = trainRMSE.anova,
trainRMSE = sqrt(sum((lm1$fitted-trainFaith$eruptions)^2/(nrow(trainFaith) - 2))),
testRMSE = sqrt(sum((predict(lm1,newdata=testFaith)-testFaith$eruptions)^2/(nrow(testFaith) - 2))))
## trainRMSE.anova       trainRMSE        testRMSE 
##       0.4950413       0.4950413       0.5062673
  • pi<-predict(lm, newdata=test, interval="prediction") = returns 3 columns for fit (predicted value, same as before), lwr (lower bound of prediction interval), and upr (upper bound of prediction interval)
    • matlines(x, pi, type="l") = plots three lines, one for the linear fit and two for upper/lower prediction interval bounds
# calculate prediction intervals
pred1 <- predict(lm1,newdata=testFaith,interval="prediction")
# plot data points (eruptions, waiting)
plot(testFaith$waiting,testFaith$eruptions,pch=19,col="blue")
# plot fit line and prediction interval
matlines(testFaith$waiting,pred1,type="l",,col=c(1,2,2),lty = c(1,1,1), lwd=3)

  • lm <- train(y ~ x, method="lm", data=train) = run linear model on the training data \(\rightarrow\) identical to lm function
    • summary(lm$finalModel) = returns summary of the linear regression model, which will include coefficients, standard errors, \(t\) statistics, and p values \(\rightarrow\) identical to summary(lm) for a lm object
    • train(y ~ ., method="lm", data=train) = run linear model on all predictors in training data
      • multiple predictors (dummy/indicator variables) are created for factor variables
    • plot(lm$finalModel) = construct 4 diagnostic plots for evaluating the model
      • Note: more information on these plots can be found at ?plot.lm
      • Residuals vs Fitted
      • Normal Q-Q
      • Scale-Location
      • Residuals vs Leverage
# same process with caret package
modFit <- train(eruptions ~ waiting, data=trainFaith, method="lm")
summary(modFit$finalModel)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.26990 -0.34789  0.03979  0.36589  1.05020 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.792739   0.227869  -7.867 1.04e-12 ***
## waiting      0.073901   0.003148  23.474  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.495 on 135 degrees of freedom
## Multiple R-squared:  0.8032, Adjusted R-squared:  0.8018 
## F-statistic:   551 on 1 and 135 DF,  p-value: < 2.2e-16

10.2 Example - Predicting with Regression and Multiple Covariates

  • Fit a linear model: \(ED_i = b_0 + b_1 age_i + b_2 I(Jobclass_i="Information") + \sum_{k=1}^4 \gamma_k I(education_i=level_k) + e_i\)
# load data
library(ISLR); library(ggplot2); library(caret);
data(Wage)
# create train and test sets
inTrain <- createDataPartition(y=Wage$wage,p=0.7, list=FALSE)
training <- Wage[inTrain,]; testing <- Wage[-inTrain,]
# fit linear model for age jobclass and education
modFit<- train(wage ~ age + jobclass + education,method = "lm",data=training)
# store final model
finMod <- modFit$finalModel
# summary of modFit
print(modFit)
## Linear Regression 
## 
## 2102 samples
##    3 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2102, 2102, 2102, 2102, 2102, 2102, ... 
## Resampling results:
## 
##   RMSE      Rsquared  MAE     
##   35.79066  0.248127  24.68782
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
# set up 2 x 2 panel plot
par(mfrow = c(2, 2))
# construct diagnostic plots for model
plot(finMod,pch=19,cex=0.5,col="#00000010")

  • plotting residuals by fitted values and coloring with a variable not used in the model helps spot a trend in that variable.
# plot fitted values by residuals
qplot(finMod$fitted, finMod$residuals, color=race, data=training)

  • plotting residuals by index (ie; row numbers) can be helpful in showing missing variables
    • plot(finMod$residuals) = plot the residuals against index (row number)
    • if there’s a trend/pattern in the residuals, it is highly likely that another variable (such as age/time) should be included.
      • residuals should not have relationship to index
# plot residual by index
plot(finMod$residuals,pch=19,cex=0.5)

  • plotting predicted values in test set versus truth in test set
pred <- predict(modFit, testing)
qplot(wage, pred, colour=year, data=testing)

  • ideally these two variables would be very close to each other. Ideally you’d have essentially a straight line on the 45 degree line where wage was exactly equal to our predictions.
    • in the test set, you can explore and try to identify trends that you might have missed. So, for example, here we’re looking at the year that the data was collected in the test set. As a way of exploring how our model might have broken down.
    • Now something to keep in mind is that if you do this sort of exploration in the test set, you can’t then go back and re-update your model in the training set because that would be using the test set to rebuild your predictors. This is more like a post-mortem on your analysis or a way to try to determine whether your analysis worked or not.

11 Prediction with Trees

11.1 Basic Algorithm

  1. start with all variables in one group
  2. find the variable that best splits the outcomes into two groups
  3. divide data into two groups (leaves) based on the split performed (node)
  4. within each split, find variables to split the groups again
  5. continue this process until all groups are sufficiently small/homogeneous/“pure”

11.2 Measures of Impurity (Reference)

\[\hat{p}_{mk} = \frac{\sum_{i}^m \mathbb{1}(y_i = k)}{N_m}\]

  • \(\hat p_{mk}\) is the probability of the objects in group \(m\) to take on the classification \(k\)
  • \(N_m\) is the size of the group

  • Misclassification Error \[ 1 - \hat{p}_{m~k(m)}\] where \(k(m)\) is the most common classification/group
    • 0 = perfect purity (no classification error)
    • 0.5 = no purity
      • Note: it is not 1 here because when \(\hat{p}_{m~k(m)} < 0.5\) or there’s not predominant classification for the objects, it means the group should be further subdivided until there’s a majority
  • Gini Index\[ \sum_{k \neq k'} \hat{p}_{mk} \times \hat{p}_{mk'} = \sum_{k=1}^K \hat{p}_{mk}(1-\hat{p}_{mk}) = 1 - \sum_{k=1}^K p_{mk}^2\]
    • 0 = perfect purity
    • 0.5 = no purity
  • Deviance \[ -\sum_{k=1}^K \hat{p}_{mk} \log_e\hat{p}_{mk} \]
    • 0 = perfect purity
    • 1 = no purity
  • Information Gain \[ -\sum_{k=1}^K \hat{p}_{mk} \log_2\hat{p}_{mk} \]
    • 0 = perfect purity
    • 1 = no purity
  • Example

# set margin and seed
par(mar=c(1,1,1,1), mfrow = c(1, 2)); set.seed(1234);
# simulate data
x = rep(1:4,each=4); y = rep(1:4,4)
# plot first scenario
plot(x,y,xaxt="n",yaxt="n",cex=3,col=c(rep("blue",15),rep("red",1)),pch=19)
# plot second scenario
plot(x,y,xaxt="n",yaxt="n",cex=3,col=c(rep("blue",8),rep("red",8)),pch=19)

  • left graph (splits the data well)
    • Misclassification Rate: \(\frac{1}{16} = 0.06\)
    • Gini: \(1 - [(\frac{1}{16})^2 + (\frac{15}{16})^2] = 0.12\)
    • Information:\(-[\frac{1}{16} \times \log_2 (\frac{1}{16})+ \frac{15}{16} \times \log_2(\frac{1}{16})] = 0.34\)
  • right graph (splits the data not well)
    • Misclassification: \(\frac{8}{16} = 0.5\)
    • Gini: \(1 - [(\frac{8}{16})^2 + (\frac{8}{16})^2] = 0.5\)
    • Information:\(-[\frac{8}{16} \times \log_2 (\frac{8}{16})+ \frac{8}{16} \times \log_2(\frac{8}{16})] = 1\)

11.3 Constructing Trees with caret Package

  • tree<-train(y ~ ., data=train, method="rpart") = constructs trees based on the outcome and predictors
    • produces an rpart object, which can be used to predict new/test values
    • print(tree$finalModel) = returns text summary of all nodes/splits in the tree constructed
  • plot(tree$finalModel, uniform=TRUE) = plots the classification tree with all nodes/splits
    • [rattle package] fancyRpartPlot(tree$finalModel) = produces more readable, better formatted classification tree diagrams
    • each split will have the condition/node in bold and the splits/leafs on the left and right sides following the “yes” or “no” indicators
      • “yes” \(\rightarrow\) go left
      • “no” \(\rightarrow\) go right
# load iris data set
data(iris)
names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"
# want to predict species
table(iris$Species)
## 
##     setosa versicolor  virginica 
##         50         50         50
# create test/train data sets
inTrain <- createDataPartition(y=iris$Species,p=0.7, list=FALSE)
training <- iris[inTrain,]
testing <- iris[-inTrain,]
# plot iris widths vs sepal widths
qplot(Petal.Width, Sepal.Width, colour=Species, data=training)

# fit classification tree as a model
modFit <- train(Species ~ .,method="rpart",data=training)
# print the classification tree
print(modFit$finalModel)
## n= 105 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 105 70 setosa (0.33333333 0.33333333 0.33333333)  
##   2) Petal.Length< 2.45 35  0 setosa (1.00000000 0.00000000 0.00000000) *
##   3) Petal.Length>=2.45 70 35 versicolor (0.00000000 0.50000000 0.50000000)  
##     6) Petal.Width< 1.65 34  1 versicolor (0.00000000 0.97058824 0.02941176) *
##     7) Petal.Width>=1.65 36  2 virginica (0.00000000 0.05555556 0.94444444) *
# simple dendogram
plot(modFit$finalModel, uniform=TRUE, main="Classification Tree")
text(modFit$finalModel, use.n=TRUE, all=TRUE, cex=0.8)

# plot the classification tree
rattle::fancyRpartPlot(modFit$finalModel)

# predict on test values
predict(modFit,newdata=testing)
##  [1] setosa     setosa     setosa     setosa     setosa     setosa    
##  [7] setosa     setosa     setosa     setosa     setosa     setosa    
## [13] setosa     setosa     setosa     versicolor versicolor versicolor
## [19] versicolor versicolor versicolor versicolor versicolor versicolor
## [25] versicolor versicolor versicolor versicolor versicolor versicolor
## [31] virginica  virginica  virginica  virginica  virginica  virginica 
## [37] versicolor virginica  virginica  versicolor versicolor virginica 
## [43] virginica  virginica  virginica 
## Levels: setosa versicolor virginica

12 Bagging

# load data
library(ElemStatLearn); data(ozone,package="ElemStatLearn")
# reorder rows based on ozone variable
ozone <- ozone[order(ozone$ozone),]
# want to predict temperature
head(ozone)
##     ozone radiation temperature wind
## 17      1         8          59  9.7
## 19      4        25          61  9.7
## 14      6        78          57 18.4
## 45      7        48          80 14.3
## 106     7        49          69 10.3
## 7       8        19          61 20.1
# create empty matrix
ll <- matrix(NA,nrow=10,ncol=155)
# iterate 10 times
for(i in 1:10){
# resample with replacement
ss <- sample(1:dim(ozone)[1],replace=T)
# draw sample from the data and reorder rows based on ozone
ozone0 <- ozone[ss,]; ozone0 <- ozone0[order(ozone0$ozone),]
# fit loess function through data (similar to spline)
loess0 <- loess(temperature ~ ozone,data=ozone0,span=0.2)
# predict for every single loess curve the outcome for ozone values 1 to 155
ll[i,] <- predict(loess0,newdata=data.frame(ozone=1:155))
}
# plot the data points
plot(ozone$ozone,ozone$temperature,pch=19,cex=0.5)
# plot each prediction model
for(i in 1:10){lines(1:155,ll[i,],col="grey",lwd=2)}
# plot the average in red
lines(1:155,apply(ll,2,mean),col="red",lwd=2)

12.1 Bagging Algorithms

  • in the caret package, there are three options for the train function to perform bagging
  • alternatively, custom bag functions can be constructed (documentation)
    • bag(predictors, outcome, B=10, bagControl(fit, predict, aggregate)) = define and execute custom bagging algorithm
      • B=10 = number of iterations/resampling to perform
      • bagControl() = controls for how the bagging should be executed
        • fit=ctreeBag$fit = the function that’s going to be applied to fit the model every time (could be a call to the train function in the caret package)
        • predict=ctreeBag$predict = the way that given a particular model fit to predict new values (could be a call to the predict function from a trained model)
        • aggregate=ctreeBag$aggregate = the way to put the predictions together (eg. average the predictions across all the different replicated samples)
  • Example
# load relevant package and data
library(party); data(ozone,package="ElemStatLearn")
# reorder rows based on ozone variable
ozone <- ozone[order(ozone$ozone),]
# extract predictors
predictors <- data.frame(ozone=ozone$ozone)
# extract outcome
temperature <- ozone$temperature
# run bagging algorithm
treebag <- bag(predictors, temperature, B = 10,
# custom bagging function
bagControl = bagControl(fit = ctreeBag$fit,
predict = ctreeBag$pred,
aggregate = ctreeBag$aggregate))
# plot data points
plot(ozone$ozone,temperature,col='lightgrey',pch=19)
# plot the first fit
points(ozone$ozone,predict(treebag$fits[[1]]$fit,predictors),pch=19,col="red")
# plot the aggregated predictions
points(ozone$ozone,predict(treebag,predictors),pch=19,col="blue")

# look at the different parts of bagging
ctreeBag$fit
## function (x, y, ...) 
## {
##     loadNamespace("party")
##     data <- as.data.frame(x)
##     data$y <- y
##     party::ctree(y ~ ., data = data)
## }
## <environment: namespace:caret>
ctreeBag$pred
## function (object, x) 
## {
##     if (!is.data.frame(x)) 
##         x <- as.data.frame(x)
##     obsLevels <- levels(object@data@get("response")[, 1])
##     if (!is.null(obsLevels)) {
##         rawProbs <- party::treeresponse(object, x)
##         probMatrix <- matrix(unlist(rawProbs), ncol = length(obsLevels), 
##             byrow = TRUE)
##         out <- data.frame(probMatrix)
##         colnames(out) <- obsLevels
##         rownames(out) <- NULL
##     }
##     else out <- unlist(party::treeresponse(object, x))
##     out
## }
## <environment: namespace:caret>
ctreeBag$aggregate
## function (x, type = "class") 
## {
##     if (is.matrix(x[[1]]) | is.data.frame(x[[1]])) {
##         pooled <- x[[1]] & NA
##         classes <- colnames(pooled)
##         for (i in 1:ncol(pooled)) {
##             tmp <- lapply(x, function(y, col) y[, col], col = i)
##             tmp <- do.call("rbind", tmp)
##             pooled[, i] <- apply(tmp, 2, median)
##         }
##         if (type == "class") {
##             out <- factor(classes[apply(pooled, 1, which.max)], 
##                 levels = classes)
##         }
##         else out <- as.data.frame(pooled)
##     }
##     else {
##         x <- matrix(unlist(x), ncol = length(x))
##         out <- apply(x, 1, median)
##     }
##     out
## }
## <environment: namespace:caret>
  • parts of bagging:
    • ctreeBag$fit uses the ctree function to train a conditional regression tree
    • ctreeBag$pred takes in the object from the ctree model fit, and a new data set x to get a new prediction
    • ctreeBag$aggregate takes those predicted values and averages them together (or puts them together in some other way)

13 Random Forest

13.1 R Commands and Examples

  • rf<-train(outcome ~ ., data=train, method="rf", prox=TRUE, ntree=500) = runs random forest algorithm on the training data against all predictors
    • Note: random forest algorithm automatically bootstrap by default, but it is still important to have train/test/validation split to verify the accuracy of the model
    • prox=TRUE = the proximity measures between observations should be calculated (used in functions such as classCenter() to find center of groups)
      • rf$finalModel$prox = returns matrix of proximities
    • ntree=500 = specify number of trees that should be constructed
    • do.trace=TRUE = prints logs as the trees are being built \(\rightarrow\) useful by indicating progress to user
    • Note: randomForest() function can be used to perform random forest algorithm (syntax is the same as train) and is much faster
  • getTree(rf$finalModel, k=2) = return specific tree from random forest model
  • classCenters(predictors, outcome, proximity, nNbr) = return computes the cluster centers using the nNbr nearest neighbors of the observations
    • prox = rf$finalModel$prox = proximity matrix from the random forest model
    • nNbr = number of nearest neighbors that should be used to compute cluster centers
  • predict(rf, test) = apply the random forest model to test data set
    • confusionMatrix(predictions, actualOutcome) = tabulates the predictions of the model against the truths
      • Note: this is generally done for the validation data set using the model built from training
  • Example
# load data
library(caret)
data(iris)
# create train/test data sets
inTrain <- createDataPartition(y=iris$Species,p=0.7, list=FALSE)
training <- iris[inTrain,]
testing <- iris[-inTrain,]
# apply random forest
modFit <- train(Species~ .,data=training,method="rf",prox=TRUE)
modFit
## Random Forest 
## 
## 105 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 105, 105, 105, 105, 105, 105, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.9481790  0.9206752
##   3     0.9501546  0.9236665
##   4     0.9513311  0.9254397
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 4.
# better to use randomForest function (faster)
modelRf <- randomForest(Species ~ ., data = training, importance = FALSE)
# return the second tree (first 6 rows)
# each of these rows corresponds to a particular split
head(getTree(modFit$finalModel,k=2))
##   left daughter right daughter split var split point status prediction
## 1             2              3         4        0.70      1          0
## 2             0              0         0        0.00     -1          1
## 3             4              5         3        4.75      1          0
## 4             6              7         4        1.65      1          0
## 5             8              9         4        1.70      1          0
## 6             0              0         0        0.00     -1          2
# compute cluster centers
irisP <- classCenter(training[,c(3,4)], training$Species, modFit$finalModel$prox)
# convert irisP to data frame and add Species column
irisP <- as.data.frame(irisP); irisP$Species <- rownames(irisP)
# plot data points
p <- qplot(Petal.Width, Petal.Length, col=Species,data=training)
# add the cluster centers
p + geom_point(aes(x=Petal.Width,y=Petal.Length,col=Species),size=5,shape=4,data=irisP)

# predict outcome for test data set using the random forest model
pred <- predict(modFit,testing)
# logic value for whether or not the rf algorithm predicted correctly
testing$predRight <- pred==testing$Species
# tabulate results
table(pred,testing$Species)
##             
## pred         setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         12         0
##   virginica       0          3        15
# plot data points with the incorrect classification highlighted
qplot(Petal.Width,Petal.Length,colour=predRight,data=testing,main="newdata Predictions")

14 Boosting

14.1 R Commands and Examples

  • gbm <- train(outcome ~ variables, method="gbm", data=train, verbose=F) = run boosting model on the given data
    • R has multiple boosting libraries (differences include the choice of basic classification functions and combination rules). Options for method of boosting are:
  • predict function can be used to apply the model to test data, similar to the rest of the algorithms in caret package

  • Example

# load data
# load data
library(ISLR); library(caret); library(ggplot2)
data(Wage)
# remove log wage variable (we are trying to predict wage)
Wage <- subset(Wage,select=-c(logwage))
# create train/test data sets
inTrain <- createDataPartition(y=Wage$wage,p=0.7, list=FALSE)
training <- Wage[inTrain,]; testing <- Wage[-inTrain,]
# run the gbm model
modFit <- train(wage ~ ., method="gbm",data=training,verbose=FALSE)
# print model summary
print(modFit)
## Stochastic Gradient Boosting 
## 
## 2102 samples
##    9 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2102, 2102, 2102, 2102, 2102, 2102, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE      Rsquared   MAE     
##   1                   50      35.40267  0.3268457  24.04840
##   1                  100      34.89519  0.3332554  23.77177
##   1                  150      34.82822  0.3343683  23.80400
##   2                   50      34.86088  0.3363506  23.67719
##   2                  100      34.72696  0.3385710  23.66990
##   2                  150      34.75429  0.3373207  23.72571
##   3                   50      34.73929  0.3384432  23.63118
##   3                  100      34.78334  0.3364309  23.74651
##   3                  150      34.88069  0.3334293  23.86710
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were n.trees = 100,
##  interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.
# plot the results (not bad, but still some variability)
qplot(predict(modFit, testing), wage, data=testing)

15 Model Based Prediction

15.1 Linear Discriminant Analysis

  • to compare the probability for outcome \(Y = k\) versus probability for outcome \(Y = j\), we can look at the ratio of \[\frac{P(Y=k~|~X=x)}{P(Y=j~|~X=x)}\]
  • take the log of the ratio and apply Bayes’ Theorem, we get \[\log \frac{P(Y = k~|~X=x)}{P(Y = j~|~X=x)} = \log \frac{f_k(x)}{f_j(x)} + \log \frac{\pi_k}{\pi_j}\] which is effectively the log ratio of probability density functions \(f(x)\) plus the log ratio of prior probabilities \(\pi\)
    • Note: \(\log\) = monotone transformation, which means taking the \(\log\) of a quantity does not affect implication of the ratio since the \(\log(ratio)\) is directly correlated with ratio
  • if we substitute \(f_k(x)\) and \(f_j(x)\) with Gaussian probability density functions \[f(x) = \frac{1}{\sigma \sqrt{2 \pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\] so the ratio can be simplified to \[\log \frac{P(Y = k~|~X=x)}{P(Y = j~|~X=x)} = \log \frac{\pi_k}{\pi_j} - \frac{1}{2}(\mu_k + \mu_j)^T \Sigma^{-1}(\mu_k + \mu_j) + x^T \Sigma^{-1} (\mu_k - \mu_j)\] where \(\Sigma^{-1}\) = inverse of the covariance matrix for the predictor variables, \(x^T\) = set of predictor variables, and \(\mu_k\) / \(\mu_j\) = mean of \(k\), \(j\) respectively
  • as annotated above, the log-ratio is effectively an equation of a line for a set of predictor variables \(x\) \(\rightarrow\) the first two terms are constants and the last term is in the form of \(X \beta\)
    • Note: the lines are also known as decision boundaries
  • therefore, we can classify values based on which side of the line the value is located (\(k\) vs \(j\))
  • discriminant functions are used to determine value of \(k\), the functions are in the form of \[\delta_k(x) = x^T \Sigma^{-1} \mu_k - \frac{1}{2}\mu_k \Sigma^{-1}\mu_k + log(\mu_k)\]
    • plugging in the set of predictor variables, \(x^T\), into the discriminant function, we pick the value of \(k\) that produces the largest value of this particular discriminate function \(\delta_k(x)\)
    • that how we decide on a class, or more formally decide on class based on \(\hat{Y}(x) = argmax_k \delta_k(x)\)
    • we usually estimate parameters with maximum likelihood
  • Example
    • classify a group of values into 3 groups using 2 variables (\(x\), \(y\) coordinates)
      • basically you fit Gaussian distributions (oval circles) to the data and then use those Gaussian distributions to draw lines that, assign the prop points to the highest posterior probabilities.
        • each line splits the data into two groups \(\rightarrow\) 1 vs 2, 2 vs 3, 1 vs 3
        • each side of the line represents a region where the probability of one group (1, 2, or 3) is the highest
    • the result is represented in the the following graph

  • R Commands
    • lda<-train(outcome ~ predictors, data=training, method="lda") = constructs a linear discriminant analysis model on the predictors with the provided training data
    • predict(lda, test) = applies the LDA model to test data and return the prediction results in data frame
    • Example: caret package
# load data
data(iris)
table(iris$Species)
## 
##     setosa versicolor  virginica 
##         50         50         50
# create training and test sets
inTrain <- createDataPartition(y=iris$Species,p=0.7, list=FALSE)
training <- iris[inTrain,]
testing <- iris[-inTrain,]
# run the linear discriminant analysis on training data
lda <- train(Species ~ .,data=training,method="lda")
# predict test outcomes using LDA model
pred.lda <- predict(lda,testing)
# print results
pred.lda
##  [1] setosa     setosa     setosa     setosa     setosa     setosa    
##  [7] setosa     setosa     setosa     setosa     setosa     setosa    
## [13] setosa     setosa     setosa     versicolor versicolor versicolor
## [19] versicolor versicolor versicolor virginica  versicolor versicolor
## [25] versicolor virginica  versicolor versicolor versicolor versicolor
## [31] virginica  virginica  virginica  virginica  virginica  virginica 
## [37] virginica  virginica  virginica  virginica  virginica  virginica 
## [43] virginica  virginica  virginica 
## Levels: setosa versicolor virginica

15.2 Naive Bayes

  • for predictors \(X_1,\ldots,X_m\), we want to model \(P(Y = k ~|~ X_1,\ldots,X_m)\)
  • by applying Bayes’ Theorem, we get \[P(Y = k ~|~ X_1,\ldots,X_m) = \frac{\pi_k P(X_1,\ldots,X_m~|~ Y=k)}{\sum_{\ell = 1}^K P(X_1,\ldots,X_m ~|~ Y=k) \pi_{\ell}}\]
  • since the denominator is just a sum (constant), we can rewrite the quantity as \[P(Y = k ~|~ X_1,\ldots,X_m) \propto \pi_k P(X_1,\ldots,X_m~|~ Y=k)\] or the probability is proportional to the numerator
    • Note: maximizing the numerator is the same as maximizing the ratio
  • \(\pi_k P(X_1,\ldots,X_m~|~ Y=k)\) can be rewritten as \[\begin{aligned} \pi_k P(X_1,\ldots,X_m~|~ Y=k) & = \pi_k P(X_1 ~|~ Y = k)P(X_2,\ldots,X_m ~|~ X_1,Y=k) \\ & = \pi_k P(X_1 ~|~ Y = k) P(X_2 ~|~ X_1, Y=k) P(X_3,\ldots,X_m ~|~ X_1,X_2, Y=k) \\ & = \pi_k P(X_1 ~|~ Y = k) P(X_2 ~|~ X_1, Y=k)\ldots P(X_m~|~X_1\ldots,X_{m-1},Y=k) \\ \end{aligned}\] where each variable has its own probability term that is conditional on all the other variables that have come before it
    • this is effectively indicating that each of the predictors may be dependent on other predictors
  • however, if we make the assumption that all predictor variables are independent to each other, the quantity can be simplified to \[ \pi_k P(X_1,\ldots,X_m~|~ Y=k) \approx \pi_k P(X_1 ~|~ Y = k) P(X_2 ~|~ Y = k)\ldots P(X_m ~|~,Y=k)\] which is effectively the product of the prior probability for \(k\) and the probability of variables \(X_1,\ldots,X_m\) given that \(Y = k\) (or conditional on being in each class)
    • Note: the assumption is naive in that it is unlikely the predictors are completely independent of each other, but this model still produces useful results particularly with large number of binary/categorical variables
      • text and document classification usually require large quantities of binary and categorical features
  • R Commands
    • nb <- train(outcome ~ predictors, data=training, method="nb") = constructs a naive Bayes model on the predictors with the provided training data
    • predict(nb, test) = applies the naive Bayes model to test data and return the prediction results in data frame
    • Example: caret package
# using the same data from iris, run naive Bayes on training data
nb <- train(Species ~ ., data=training,method="nb")
# predict test outcomes using naive Bayes model
pred.nb <- predict(nb,testing)
# print results
pred.nb
##  [1] setosa     setosa     setosa     setosa     setosa     setosa    
##  [7] setosa     setosa     setosa     setosa     setosa     setosa    
## [13] setosa     setosa     setosa     versicolor versicolor versicolor
## [19] versicolor versicolor versicolor virginica  versicolor versicolor
## [25] versicolor versicolor versicolor versicolor versicolor versicolor
## [31] virginica  virginica  virginica  virginica  virginica  virginica 
## [37] virginica  virginica  virginica  virginica  virginica  virginica 
## [43] virginica  virginica  virginica 
## Levels: setosa versicolor virginica

15.3 Compare Results for LDA and Naive Bayes

  • linear discriminant analysis and naive Bayes generally produce similar results for small data sets
  • for our example data from iris data set, we can compare the prediction the results from the two models
# tabulate the prediction results from LDA and naive Bayes
table(pred.lda,pred.nb)
##             pred.nb
## pred.lda     setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         13         0
##   virginica       0          1        16
# create logical variable that returns TRUE for when predictions from the two models match
equalPredictions <- (pred.lda==pred.nb)
# plot the comparison
qplot(Petal.Width,Sepal.Width,colour=equalPredictions,data=testing)

  • as can be seen overall the two approaches perform very similarly. Point(s) in between may be different.

16 Model Selection

16.1 Example: Training vs Test Error for Combination of Predictors

  • Note: the code for this example comes from Hector Corrada Bravo’s Practical Machine Learning Course
  • to demonstrate the behavior of training and test errors, the prostate dataset from Elements of Statistical Learning is used
  • all combinations of predictors are used to produce prediction models, and Residual Squared Error (RSS) is calculated for all models on both the training and test sets
library(ElemStatLearn)
# load data and set seed
data(prostate); set.seed(1)
str(prostate)
## 'data.frame':    97 obs. of  10 variables:
##  $ lcavol : num  -0.58 -0.994 -0.511 -1.204 0.751 ...
##  $ lweight: num  2.77 3.32 2.69 3.28 3.43 ...
##  $ age    : int  50 58 74 58 62 50 64 58 47 63 ...
##  $ lbph   : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ svi    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lcp    : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ gleason: int  6 6 7 6 6 6 6 6 6 6 ...
##  $ pgg45  : int  0 0 20 0 0 0 0 0 0 0 ...
##  $ lpsa   : num  -0.431 -0.163 -0.163 -0.163 0.372 ...
##  $ train  : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
# define outcome y and predictors x
covnames <- names(prostate[-(9:10)])
y <- prostate$lpsa; x <- prostate[,covnames]
# create test set predictors and outcomes
train.ind <- sample(nrow(prostate), ceiling(nrow(prostate))/2)
y.test <- prostate$lpsa[-train.ind]; x.test <- x[-train.ind,]
# create training set predictors and outcomes
y <- prostate$lpsa[train.ind]; x <- x[train.ind,]
# p = number of predictors
p <- length(covnames)
# initialize the list of residual sum squares
rss <- list()
# loop through each combination of predictors and build models
for (i in 1:p) {
# compute matrix for p choose i predictors for i = 1...p (creates i x p matrix)
Index <- combn(p,i)
# calculate residual sum squares of each combination of predictors
rss[[i]] <- apply(Index, 2, function(is) {
# take each combination (or column of Index matrix) and create formula for regression
form <- as.formula(paste("y~", paste(covnames[is], collapse="+"), sep=""))
# run linear regression with combination of predictors on training data
isfit <- lm(form, data=x)
# predict outcome for all training data points
yhat <- predict(isfit)
# calculate residual sum squares for predictions on training data
train.rss <- sum((y - yhat)^2)
# predict outcome for all test data points
yhat <- predict(isfit, newdata=x.test)
# calculate residual sum squares for predictions on test data
test.rss <- sum((y.test - yhat)^2)
# store each pair of training and test residual sum squares as a list
c(train.rss, test.rss)
})
}
# set up plot with labels, title, and proper x and y limits
plot(1:p, 1:p, type="n", ylim=range(unlist(rss)), xlim=c(0,p),
xlab="Number of Predictors", ylab="Residual Sum of Squares",
main="Prostate Cancer Data - Training vs Test RSS")
# add data points for training and test residual sum squares
for (i in 1:p) {
# plot training residual sum squares in blue
points(rep(i, ncol(rss[[i]])), rss[[i]][1, ], col="blue", cex = 0.5)
# plot test residual sum squares in red
points(rep(i, ncol(rss[[i]])), rss[[i]][2, ], col="red", cex = 0.5)
}
# find the minimum training RSS for each combination of predictors
minrss <- sapply(rss, function(x) min(x[1,]))
# plot line through the minimum training RSS data points in blue
lines((1:p), minrss, col="blue", lwd=1.7)
# find the minimum test RSS for each combination of predictors
minrss <- sapply(rss, function(x) min(x[2,]))
# plot line through the minimum test RSS data points in blue
lines((1:p), minrss, col="red", lwd=1.7)
# add legend
legend("topright", c("Train", "Test"), col=c("blue", "red"), pch=1)

  • from the above, we can clearly that test RSS error approaches the minimum at around 3 predictors and increases slightly as more predictors are used (overfitting in the training set)

16.2 Split Samples

  • the best method to pick predictors/model is to split the given data into different test sets
  • process
    1. divide data into training/test/validation sets (60 - 20 - 20 split)
    2. train all competing models on the training data
    3. apply the models on validation data and choose the best performing model
    4. you may re-perform the splitting and analysis several times (repeat steps 1 to 3), in order to get a better average estimate of what the out of sample error rate will be
    5. apply the overall best performing model on test set to appropriately assess performance on new data
  • common problems
    • limited data = if not enough data is available, it may not be possible to produce a good model fit after splitting the data into 3 sets
    • computational complexity = modeling with all subsets of models can be extremely taxing in terms of computations, especially when a large number of predictors are available

16.3 Decompose Expected Prediction Error

  • the outcome \(Y_i\) can be modeled by \[Y_i = f(X_i) + \epsilon_i\] where \(\epsilon_i\) = error term
  • the expected prediction error is defined as \[EPE(\lambda) = E\left[\left(Y - \hat{f}_{\lambda}(X)\right)^2\right]\] where \(\lambda\) = specific set of tuning parameters
  • Suppose \(\hat{f}_{\lambda}(x^*)\) is the estimate from the training data and look at a new data point \(X = x^*\)
  • then the expected prediction error is as follows \[\begin{aligned} E\left[\left(Y - \hat{f}_{\lambda}(x^*)\right)^2\right] & = \sigma^2 + \left(E[\hat{f}_{\lambda}(x^*)] - f(x^*)\right)^2 + E\left[\hat{f}_{\lambda}(x^*) - E[\hat{f}_{\lambda}(x^*)]\right]^2\\ & = \mbox{Irreducible Error} + \mbox{Bias}^2 + \mbox{Variance}\\ \end{aligned} \]
    • goal of prediction model = minimize overall expected prediction error
    • irreducible error = noise inherent to the data collection process \(\rightarrow\) cannot be reduced
    • bias/variance = can be traded in order to find optimal model (least error)

16.4 Hard Thresholding

  • Example
    • as we can see from the results below, some of the coefficients have values of NA
# load prostate data
data(prostate)
# create subset of observations with 10 variables
small = prostate[1:5,]
# fit linear model
lm(lpsa ~ .,data =small)
## 
## Call:
## lm(formula = lpsa ~ ., data = small)
## 
## Coefficients:
## (Intercept)       lcavol      lweight          age         lbph  
##     9.60615      0.13901     -0.79142      0.09516           NA  
##         svi          lcp      gleason        pgg45    trainTRUE  
##          NA           NA     -2.08710           NA           NA
  • if there are more predictors than you have samples (high-dimensional data), linear regressions will only return coefficients for some of the variables because there’s not enough data to estimate the rest of the parameters
    • conceptually, this occurs because the design matrix that the model is based on cannot be inverted
    • Note: ridge regression can help address this problem
  • hard thresholding can help estimate the coefficients/model by taking subsets of predictors and building models
  • process
    • model the outcome as \[Y_i = f(X_i) + \epsilon_i\] where \(\epsilon_i\) = error term
    • assume the prediction estimate has a linear form \[\hat{f}_{\lambda}(x) = x'\beta\]
    • constrain only \(\lambda\) coefficients for the set of predictors \(x\) to be nonzero
    • after setting the value of \(\lambda\), compute models using all combinations of \(\lambda\) variables to find which variables’ coefficients should be set to be zero (\(p-\lambda\))
  • problem
    • computationally intensive

16.5 Regularized Regression Concept (Resource)

  • regularized regression = fit a regression model and penalize (or shrink) the large coefficients in attempt to help with bias/variance trade-off or model selection
    • when running regressions unconstrained (without specifying any criteria for coefficients \(\beta_j\)’s’), the model may be susceptible to high variance (coefficients explode \(\rightarrow\) very large values) if there are variables that are highly correlated
    • controlling/regularizing coefficients may slightly increase bias (lose a bit of prediction capability) but will reduce variance and improve the prediction error
    • however, this approach may be very demanding computationally and generally does not perform as well as random forest and boosting
  • To control variance, Penalized Residual Sum of Squares (PRSS) is calculated by adding a penalty term to the prediction squared error \[PRSS(\beta) = \sum_{j=1}^n (Y_j - \sum_{i=1}^m \beta_{1i} X_{ij})^2 + P(\lambda; \beta)\]
    • penalty shrinks coefficients if their values become too large
    • penalty term is generally used to reduce complexity and variance for the model, while respecting the structure of the data/relationship
  • Example: co-linear variables
    • given a linear model, \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon\] where \(X_1\) and \(X_2\) are nearly perfectly correlated (co-linear)
    • the model can then be approximated by this model by omitting \(X_2\), so the model becomes \[Y = \beta_0 + (\beta_1 + \beta_2)X_1 + \epsilon\]
    • with the above model, we can get a good estimate of \(Y\)
      • the estimate of \(Y\) will be biased
      • but the variance of the prediction may be reduced

16.6 Regularized Regression - Ridge Regression

  • the penalized residual sum of squares (PRSS) takes the form of \[PRSS(\beta_j) = \sum_{i=1}^N \left(y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j \right)^2 + \lambda \sum_{j=1}^p \beta_j^2\]
  • this is equivalent to solving the equation \[PRSS(\beta_j) = \sum_{i=1}^N \left(y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j \right)^2\] subject to constraint \(\sum_{j=1}^p \beta_j^2 \leq s\) where \(s\) is inversely proportional to \(\lambda\)
    • if the coefficients \(\beta_j\) are large in value, the term \(\sum_{j=1}^p \beta_j^2\) will cause the overall PRSS value to increase, leading to worse models
    • the presence of the term thus requires some of the coefficients to be small
  • inclusion of \(\lambda\) makes the problem non-singular even if \(X^TX\) is not invertible
    • this means that even in cases where there are more predictors than observations, the ridge regression model can still be fit
  • \(\lambda\) = tuning parameter
    • controls size of coefficients or the amount of regularization
    • as \(\lambda \rightarrow 0\), the result approaches the least square solution
    • as \(\lambda \rightarrow \infty\), all of the coefficients receive large penalties and the conditional coefficients \(\hat{\beta}_{\lambda=\infty}^{ridge}\) approaches zero collectively
    • \(\lambda\) should be carefully chosen through cross-validation/other techniques to find the optimal tradeoff of bias for variance
    • Note: it is important realize that all coefficients (though they may be shrunk to very small values) will still be included in the model when applying ridge regression
  • R Commands
    • (MASS package) ridge<-lm.ridge(outcome ~ predictors, data=training, lambda=5) = perform ridge regression with given outcome and predictors using the provided \(\lambda\) value
      • Note: the predictors are centered and scaled first before the regression is run
      • lambda=5 = tuning parameter
      • ridge$xm = returns column/predictor mean from the data
      • ridge$scale = returns the scaling performed on the predictors for the ridge regression
        • Note: all the variables are divided by the biased standard deviation \(\sum (X_i - \bar X_i) / n\)
      • ridge$coef = returns the conditional coefficients, \(\beta_j\) from the ridge regression
      • ridge$ym = return mean of outcome
    • (caret package) train(outcome ~ predictors, data=training, method="ridge", lambda=5) = perform ridge regression with given outcome and predictors
      • preProcess=c("center", "scale") = centers and scales the predictors before the model is built
        • Note: this is generally a good idea for building ridge regressions
      • lambda=5 = tuning parameter
    • (caret package) train(outcome ~ predictors, data=training, method="foba", lambda=5, k=4) = perform ridge regression with variable selection
      • lambda=5 = tuning parameter
      • k=4 = number of variables that should be retained
        • this means that length(predictors)-k variables will be eliminated
    • (caret package) predict(model,test) = use the model to predict on test set \(\rightarrow\) similar to all other caret algorithms
  • Example: ridge coefficient paths vs \(\lambda\)
    • using the same prostate dataset, we will run ridge regressions with different values of \(\lambda\) and find the optimum \(\lambda\) value that minimizes test RSS
# load MASS library for lm.ridge function
library(MASS)
# we will run the models for 10 different values from 0 to 50
lambdas <- seq(0,50,len=10)
# create empty vectors for training and test residual sum squares
M <- length(lambdas)
train.rss <- rep(0,M); test.rss <- rep(0,M)
# create empty matrix for coefficients/betas
betas <- matrix(0,ncol(x),M)
# run ridge regressions on the predictors for all values of lambda
for(i in 1:M){
# create formula for the ridge regression (outcome ~ predictors)
Formula <-as.formula(paste("y~",paste(covnames,collapse="+"),sep=""))
# run ridge regression with the ith lambda value
fit1 <- lm.ridge(Formula,data=x,lambda=lambdas[i])
# store the coefficients of the ridge regression as a column in betas matrix
betas[,i] <- fit1$coef
# center the training data based on column means
scaledX <- sweep(as.matrix(x),2,fit1$xm)
# scale the training data based on standard deviations
scaledX <- sweep(scaledX,2,fit1$scale,"/")
# calculate predictions for training data using the model coefficients
yhat <- scaledX%*%fit1$coef+fit1$ym
# calculate residual sum squares for training data
train.rss[i] <- sum((y - yhat)^2)
# center the test data based on column means
scaledX <- sweep(as.matrix(x.test),2,fit1$xm)
# scale the test data based on standard deviations
scaledX <- sweep(scaledX,2,fit1$scale,"/")
# calculate predictions for test data using the model coefficients
yhat <- scaledX%*%fit1$coef+fit1$ym
# calculate residual sum squares for test data
test.rss[i] <- sum((y.test - yhat)^2)
}
# plot the line for test RSS vs lambda in red
plot(lambdas,test.rss,type="l",col="red",lwd=2,ylab="RSS",ylim=range(train.rss,test.rss))
# plot the line for training RSS vs lambda in blue
lines(lambdas,train.rss,col="blue",lwd=2,lty=2)
# find the best lambda that minimizes test RSS
best.lambda <- lambdas[which.min(test.rss)]
# plot vertical line marking the best lambda
abline(v=best.lambda)
# add text label
text(best.lambda+5, min(test.rss)-2, paste("best lambda=", round(best.lambda, 2)))
# add legend to plot
legend(30,30,c("Train","Test"),col=c("blue","red"),lty=c(2,1))

# plot all coefficients vs values of lambda
plot(lambdas,betas[1,],ylim=range(betas),type="n",ylab="Coefficients")
for(i in 1:ncol(x))
lines(lambdas,betas[i,],type="b",lty=i,pch=as.character(i), cex = 0.5)
# add horizontal line for reference
abline(h=0)
# add legend to plot
legend("topright",covnames,pch=as.character(1:8), cex = 0.5)

16.7 Regularized Regression - LASSO Regression

  • LASSO (least absolute shrinkage and selection operator) was introduced by Tibshirani (Journal of the Royal Statistical Society 1996)
  • similar to ridge, with slightly different penalty term
  • the penalized residual sum of squares (PRSS) takes the lagrangian form of \[PRSS(\beta_j) = \sum_{i=1}^N \left(y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j \right)^2 + \lambda \sum_{j=1}^p |\beta_j|\]
  • this is equivalent to solving the equation \[PRSS(\beta_j) = \sum_{i=1}^N \left(y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j \right)^2\] subject to constraint \(\sum_{j=1}^p |\beta_j| \leq s\) where \(s\) is inversely proportional to \(\lambda\)
  • For orthonormal design matrices (not the norm!) this has a closed form solution \[\hat{\beta}_j = sign(\hat{\beta}_j^0)(|\hat{\beta}_j^0 - \gamma|)^{+}\] but not in general
    • Its basically saying the lasso shrinks all of the coefficients and set some of them to exactly 0.
    • Some people like this approach because it both shrinks coefficients. And by setting something exactly to 0 it performs model selection for you in advance
  • \(\lambda\) = tuning parameter
    • controls size of coefficients or the amount of regularization
    • large values of \(\lambda\) will set some coefficient equal to zero
      • Note: LASSO effectively performs model selection (choose subset of predictors) while shrinking other coefficients, where as Ridge only shrinks the coefficients
  • R Commands
    • (lars package) lasso<-lars(as.matrix(x), y, type="lasso", trace=TRUE) = perform lasso regression by adding predictors one at a time (or setting some variables to 0)
      • Note: the predictors are centered and scaled first before the regression is run
      • as.matrix(x) = the predictors must be in matrix/dataframe format
      • trace=TRUE = prints progress of the lasso regression
      • lasso$lambda = return the \(\lambda\)s used for each step of the lasso regression
      • plot(lasso) = prints plot that shows the progression of the coefficients as they are set to zero one by one
      • predit.lars(lasso, test) = use the lasso model to predict on test data
        • Note: more information/documentation can be found in ?predit.lars
    • (lars package) cv.lars(as.matrix(x), y, K=10, type="lasso", trace=TRUE) = computes K-fold cross-validated mean squared prediction error for lasso regression
      • effectively the lars function is run K times with each of the folds to estimate the
      • K=10 = create 10-fold cross validation
      • trace=TRUE = prints progress of the lasso regression
    • (enet package) lasso<-enet(predictors, outcome, lambda = 0) = perform elastic net regression on given predictors and outcome
      • lambda=0 = default value for \(\lambda\)
        • Note: lasso regression is a special case of elastic net regression, and forcing lambda=0 tells the function to fit a lasso regression
      • plot(lasso) = prints plot that shows the progression of the coefficients as they are set to zero one by one
      • predict.ent(lasso, test)= use the lasso model to predict on test data
    • (caret package) train(outcome ~ predictors, data=training, method="lasso") = perform lasso regression with given outcome and predictors
      • Note: outcome and predictors must be in the same dataframe
      • preProcess=c("center", "scale") = centers and scales the predictors before the model is built
        • Note: this is generally a good idea for building lasso regressions
    • (caret package) train(outcome~predictors,data=train,method="relaxo",lambda=5,phi=0.3) = perform relaxed lasso regression on given predictors and outcome
      • lambda=5 = tuning parameter
      • phi=0.3 = relaxation parameter
        • phi=1 corresponds to the regular Lasso solutions
        • phi=0 computes the OLS estimates on the set of variables selected by the Lasso
    • (caret package) predict(model,test) = use the model to predict on test set \(\rightarrow\) similar to all other caret algorithms
  • Example: lars package
# load lars package
library(lars)
# perform lasso regression
lasso.fit <- lars(as.matrix(x), y, type="lasso", trace=TRUE)
# plot lasso regression model
plot(lasso.fit, breaks=FALSE, cex = 0.75)
# add legend
legend("topleft", covnames, pch=8, lty=1:length(covnames),
col=1:length(covnames), cex = 0.6)

# plots the cross validation curve
lasso.cv <- cv.lars(as.matrix(x), y, K=10, type="lasso", trace=TRUE)

17 Combining Predictors

17.1 Example - Majority Vote

  • suppose we have 5 independent classifiers/models
  • each has 70% accuracy
  • majority vote accuracy (mva) = probability of the majority of the models achieving 70% at the same time (ie. accuracy is 70% for each classifier)

\[\begin{aligned} \mbox{majority vote accuracy} & = p(3~correct,~2~wrong) + p(4~correct,~1~wrong) \\ &\qquad+ p(5~correct) \\ & = {5 \choose 3} \times(0.7)^3(0.3)^2 + {5 \choose 4} \times(0.7)^4(0.3)^1 - {5 \choose 5} (0.7)^5 \\ & = 10 \times(0.7)^3(0.3)^2 + 5 \times(0.7)^4(0.3)^2 - 1 \times (0.7)^5 \\ & = 83.7% \\ \end{aligned}\]

  • with 101 independent classifiers, the majority vote accuracy becomes 99.9%

17.2 Example - Model Ensembling

library(ISLR); library(caret); library(ggplot2)
data(Wage)
Wage <- subset(Wage, select=-c(logwage))
# create a building data set and validation set
inBuild <- createDataPartition(y=Wage$wage,p=0.7, list=FALSE)
validation <- Wage[-inBuild,]; buildData <- Wage[inBuild,]
inTrain <- createDataPartition(y=buildData$wage,p=0.7, list=FALSE)
training <- buildData[inTrain,]; testing <- buildData[-inTrain,]
# train the data using both glm and random forest models
glm.fit <- train(wage ~.,method="glm",data=training)
rf.fit <- train(wage ~.,method="rf",data=training,
trControl = trainControl(method="cv"),number=3)
# use the models to predict the results on the testing set
glm.pred.test <- predict(glm.fit,testing)
rf.pred.test <- predict(rf.fit,testing)
# plot the predicted values from the two models
qplot(glm.pred.test,rf.pred.test,colour=wage,data=testing)

# combine the prediction results and the true results into new data frame
combinedTestData <- data.frame(glm.pred=glm.pred.test,
rf.pred = rf.pred.test,wage=testing$wage)
# run a Generalized Additive Model (gam) model on the combined test data
comb.fit <- train(wage ~.,method="gam",data=combinedTestData)
# use the resultant model to predict on the test set
comb.pred.test <- predict(comb.fit, combinedTestData)
# use the glm and rf models to predict results on the validation data set
glm.pred.val <- predict(glm.fit,validation)
rf.pred.val <- predict(rf.fit,validation)
# combine the results into data frame for the comb.fit
combinedValData <- data.frame(glm.pred=glm.pred.val,rf.pred=glm.pred.val)
# run the comb.fit on the combined validation data
comb.pred.val <- predict(comb.fit,combinedValData)
# tabulate the results - test data set RMSE Errors
rbind(test = c(glm = sqrt(mean((glm.pred.test-testing$wage)^2)),
rf = sqrt(mean((rf.pred.test-testing$wage)^2)),
combined = sqrt(mean((comb.pred.test-testing$wage)^2))),
# validation data set RMSE Errors
validation = c(sqrt(mean((glm.pred.val-validation$wage)^2)),
sqrt(mean((rf.pred.val-validation$wage)^2)),
sqrt(mean((comb.pred.val-validation$wage)^2))))
##                 glm       rf combined
## test       34.26616 35.58763 33.90359
## validation 35.40900 36.42254 35.25649

18 Forecasting

18.1 R Commands and Examples

  • quantmod package can be used to pull trading/price information for publicly traded stocks
    • getSymbols("TICKER", src="google", from=date, to=date) = gets the daily high/low/open/close price and volume information for the specified stock ticker
      • returns the data in a data frame under the stock ticker’s name
      • "TICKER" = ticker of the stock you are attempting to pull information for
      • src="google" = get price/volume information from Google finance
        • default source of information is Yahoo Finance
      • from and to = from and to dates for the price/volume information
        • both arguments must be specified with date objects
      • Note: more information about how to use getSymbols can be found in the documentation ?getSymbols
    • to.monthly(GOOG) = converts stock data to monthly time series from daily data
      • the function aggregates the open/close/high/low/volume information for each day into monthly data
      • GOOG = data frame returned from getSymbols function
      • Note: ?to.period contains documentation for converting time series to OHLC (open high low close) series
    • googOpen<-Op(GOOG) = returns the opening price from the stock data frame
      • Cl(), Hi(), Lo() = returns the close, high and low price from the stock data frame
    • ts(googOpen, frequency=12) = convert data to a time series with frequency observations per time unit
      • frequency=12 = number of observations per unit time (12 in this case because there are 12 months in each year \(\rightarrow\) converts data into yearly time series)
  • decompose(ts) = decomposes time series into trend, seasonal, and irregular components by using moving averages
    • ts = time series object
  • window(ts, start=1, end=6) = subsets the time series at the specified starting and ending points
    • start and end arguments must correspond to the time unit rather than the index
      • for instance, if the ts is a yearly series (frequency = 12), start/end should correspond to the row numbers or year (each year has 12 observations corresponding to the months)
      • c(1, 7) can be used to specify the element of a particular year (in this case, July of the first year/row)
      • Note: you can use 9.5 or any decimal as values for start/end, and the closest element (June of the 9th year in this case) will be used
      • Note: end=9-0.01 can be used a short cut to specify “up to 9”, since end = 9 will include the first element of the 9th row
  • forecast package can be used for forecasting time series data
    • ma(ts, order=3) = calculates the simple moving average for the order specified
      • order=3 = order of moving average smoother, effectively the number of values that should be used to calculate the moving average
    • ets(train, model="MMM") = runs exponential smoothing model on training data
      • model = "MMM" = method used to create exponential smoothing
        • Note: more information can be found at ?ets and the corresponding model chart is here
    • forecast(ts) = performs forecast on specified time series and returns 5 columns: forecast values, high/low 80 confidence intervals bounds, high/low 95 percent interval bounds
      • plot(forecast) = plots the forecast object, which includes the training data, forecast values for test periods, as well as the 80 and 95 percent confidence interval regions
    • accuracy(forecast, testData) = returns the accuracy metrics (RMSE, etc.) for the forecast model
  • quandl package is also used for finance-related predictions
  • Example: decomposed time series
# load quantmod package
library(quantmod);
# specify to and from dates
from.dat <- as.Date("01/01/00", format="%m/%d/%y")
to.dat <- as.Date("3/2/15", format="%m/%d/%y")
# get data for AAPL from Google Finance for the specified dates
getSymbols("AAPL", src="google", from = from.dat, to = to.dat)
## [1] "AAPL"
# convert the retrieved daily data to monthly data
mAAPL <- to.monthly(AAPL)
# extract the closing price and convert it to yearly time series (12 observations per year)
ts <- ts(Cl(mAAPL), frequency = 12)
# plot the decomposed parts of the time series
plot(decompose(ts),xlab="Years")

  • Example: forecast
# load forecast library
library(forecast)
# find the number of rows (years)
rows <- ceiling(length(ts)/12)
# use 90% of the data to create training set
ts.train <- window(ts, start = 1, end = floor(rows*.9)-0.01)
# use the rest of data to create test set
ts.test <- window(ts, start = floor(rows*.9))
# show time series on the training set
ts.train
##      Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
## 1   3.71  4.09  4.85  4.43  3.00  3.74  3.63  4.35  1.84  1.40  1.18  1.06
## 2   1.54  1.30  1.58  1.82  1.42  1.66  1.34  1.33  1.11  1.25  1.52  1.56
## 3   1.77  1.55  1.69  1.73  1.66  1.27  1.09  1.05  1.04  1.15  1.11  1.02
## 4   1.03  1.07  1.01  1.02  1.28  1.36  1.51  1.62  1.48  1.64  1.49  1.53
## 5   1.61  1.71  1.93  1.84  2.00  2.32  2.31  2.46  2.77  3.74  4.79  4.58
## 6   5.49  6.41  5.95  5.15  5.68  5.26  6.09  6.70  7.66  8.23  9.69 10.27
## 7  10.79  9.78  8.96 10.06  8.54  8.18  9.71  9.69 11.00 11.58 13.09 12.12
## 8  12.25 12.09 13.27 14.26 17.31 17.43 18.82 19.78 21.92 27.14 26.03 28.30
## 9  19.34 17.86 20.50 24.85 26.96 23.92 22.71 24.22 16.24 15.37 13.24 12.19
## 10 12.88 12.76 15.02 17.98 19.40 20.35 23.34 24.03 26.48 26.93 28.56 30.10
## 11 27.44 29.23 33.57 37.30 36.70 35.93 36.75 34.73 40.54 43.00 44.45 46.08
## 12 48.47 50.46 49.79 50.02 49.69 47.95 55.78 54.98 54.47 57.83 54.60 57.86
## 13 65.21 77.49 85.65 83.43 82.53 83.43 87.25 95.03 95.30 85.05 83.61 76.02
# plot the training set
plot(ts.train)
# add the moving average in red
lines(ma(ts.train,order=3),col="red")

# compute the exponential smoothing average
ets <- ets(ts.train,model="MMM")
# construct a forecasting model using the exponential smoothing function
fcast <- forecast(ets)
# plot forecast and add actual data in red
plot(fcast); lines(ts.test,col="red")

# print the accuracy of the forecast model
accuracy(fcast,ts.test)
##                     ME      RMSE       MAE        MPE     MAPE      MASE
## Training set 0.3738734  2.576067  1.438703  0.2377946 10.38942 0.1759433
## Test set     0.7292502 18.507031 15.347659 -3.6663982 19.01227 1.8769111
##                   ACF1 Theil's U
## Training set 0.1026281        NA
## Test set     0.8661382  3.200942

19 Unsupervised Prediction

19.1 R Commands and Examples

  • kmeans(data, centers=3) = can be used to perform clustering from the provided data
    • centers=3 = controls the number of clusters the algorithm should aim to divide the data into
  • cl_predict function in clue package provides similar functionality
library(ggplot2)
# load iris data
data(iris)
# create training and test sets
inTrain <- createDataPartition(y=iris$Species,p=0.7, list=FALSE)
training <- iris[inTrain,]; testing <- iris[-inTrain,]
# perform k-means clustering for the data without the Species information
# Species = what the true clusters are
kMeans1 <- kmeans(subset(training,select=-c(Species)),centers=3)
# add clusters as new variable to training set
training$clusters <- as.factor(kMeans1$cluster)
# plot clusters vs Species classification
p1 <- qplot(Petal.Width,Petal.Length,colour=clusters,data=training) +
ggtitle("Clusters Classification")
p2 <- qplot(Petal.Width,Petal.Length,colour=Species,data=training) +
ggtitle("Species Classification (Truth)")
grid.arrange(p1, p2, ncol = 2)

  • as we can see, there are three clear groups that emerge from the data
    • this is fairly close to the actual results from Species
    • we can compare the results from the clustering and Species classification by tabulating the values
      • however in reality we don’t know the names of the species
# tabulate the results from clustering and actual species
table(kMeans1$cluster,training$Species)
##    
##     setosa versicolor virginica
##   1      0         33         4
##   2      0          2        31
##   3     35          0         0
  • with the clusters determined, the training data can be trained on all predictors with the clusters from k-means as outcome
# build classification trees using the k-means clusters
clustering <- train(clusters ~.,data=subset(training,select=-c(Species)),method="rpart")
  • we can compare the prediction results on training set vs truth
# tabulate the prediction results on training set vs truth
table(predict(clustering,training),training$Species)
##    
##     setosa versicolor virginica
##   1      0         31         1
##   2      0          4        34
##   3     35          0         0
  • similarly, we can compare the prediction results on test set vs truth
# tabulate the prediction results on test set vs truth
table(predict(clustering,testing),testing$Species)
##    
##     setosa versicolor virginica
##   1      0         13         0
##   2      0          2        15
##   3     15          0         0

20 Multicore Parallel Processing

20.1 Parallel Processing using doMC package

  • many of the algorithms in the caret package are computationally intensive
  • since most of the modern machines have multiple cores on their CPUs, it is often wise to enable multicore parallel processing to expedite the computations
  • doMC package is recommended to be used for caret computations (reference)
    • doMC::registerDoMC(cores=4) = registers 4 cores for R to utilize
    • the number of cores you should specify depends on the CPU on your computer (system information usually contains the number of cores)
      • it’s also possible to find the number of cores by directly searching for your CPU model number on the Internet
    • Note: once registered, you should see in your task manager/activity monitor that 4 “R Session” appear when you run your code
  • Unfortunately, the documentation of parallel processing with caret uses a technique, the doMC package, which is not available for Microsoft Windows versions of R.

20.2 Parallel Implementation of Random Forest in caret using parallel package

  • the process for executing a random forest model (or any other model) in caret::train() is relatively straightforward, and includes the following steps:
  1. Configure parallel processing
  2. Configure trainControl object
  3. Develop training model
  4. De-register parallel processing cluster

Prerequisite: Selecting a Machine Learning Problem

  • For the purpose of illustrating the syntax required for parallel processing, we’ll use the Sonar data set
library(mlbench)
data(Sonar)
library(caret)
set.seed(95014)
# create training & testing data sets
inTraining <- createDataPartition(Sonar$Class, p = .75, list=FALSE)
training <- Sonar[inTraining,]
testing <- Sonar[-inTraining,]
# set up training run for x / y syntax because model format performs poorly
x <- training[,-61]
y <- training[,61]

Step 1: Configure parallel processing

  • Parallel processing in caret can be accomplished with the parallel and doParallel packages. The following code loads the required libraries (note, these libraries also depend on the iterators and foreach libraries).
library(parallel)
library(doParallel)
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)

Step 2: Configure trainControl object

  • The most critical arguments for the trainControl function are the resampling method method, the number that specifies the quantity of folds for k-fold cross-validation, and allowParallel which tells caret to use the cluster that we’ve registered in the previous step.
fitControl <- trainControl(method = "cv",
number = 10,
allowParallel = TRUE)

Step 3: Develop training model

  • Next, we use caret::train() to train the model, using the trainControl() object that we just created.
fit <- train(x,y, method="rf",data=Sonar,trControl = fitControl)

Step 4: De-register parallel processing cluster

  • After processing the data, we explicitly shut down the cluster by calling the stopCluster() and registerDoSEQ() functions. registerDoSEQ() function is required to force R to return to single threaded processing.
stopCluster(cluster)
registerDoSEQ()
  • At this point we have a trained model in the fit object, and can take a number of steps to evaluate the suitability of this model, including accuracy and a confusion matrix that is based on comparing the modeled data to the held out folds.
fit
fit$resample
confusionMatrix.train(fit)